需要帮助与 lxml.html 和 xpath
data = response.texttree = html.fromstring(data)
Services_Product = tree.xpath("//dt/following-sibling::dd")这需要更多的工作。这个领域是
file = open('html_01.txt', 'r')
data = file.read()
tree = html.fromstring(data)
Services_Product = tree.xpath("//dt/following-sibling::dd")
stuff = Services_Product.xpath("//li")
for elem in stuff:
print(elem.text)
可以用作BS的解析器,我总是这样做。
soup = BeautifulSoup(response.content, 'lxml')
from lxml import html
import requests
resonse = requests.get(url)
tree = html.fromstring(resonse.content)
prod = tree.xpath('//*[@id="business-info"]/dl/dd/ul')
for tag in prod.getchildren():
print(tag.text) 当使用不同的标签并希望从该文本时,在已找到的部分上使用html2文本更容易
import html2text
data = '''\
<dd>
A block of text here.... bla bla bla....
<ul>
<li><p>Item 1.for some reason they wraped this in a p</p></li>
<li><strong>And this item is important</strong>bla bla bla</li>
<li>And just more info here...</li>
</ul>
And finally more stuff here...
</dd>'''
text = html2text.HTML2Text()
text.mark_code = True
text.ignore_emphasis = True
text.single_line_break = True
text.ignore_links = True
text = text.handle(data)
print(text.strip())
页:
[1]