添方夜弹 发表于 2021-2-23 09:42:38

需要帮助与 lxml.html 和 xpath

data = response.text
tree = html.fromstring(data)
Services_Product = tree.xpath("//dt/following-sibling::dd")这需要更多的工作。这个领域是
file = open('html_01.txt', 'r')
data = file.read()
tree = html.fromstring(data)
Services_Product = tree.xpath("//dt/following-sibling::dd")
stuff = Services_Product.xpath("//li")
for elem in stuff:
    print(elem.text)

远方的树 发表于 2021-2-25 10:36:41

可以用作BS的解析器,我总是这样做。
soup = BeautifulSoup(response.content, 'lxml')
from lxml import html
import requests


resonse = requests.get(url)
tree = html.fromstring(resonse.content)
prod = tree.xpath('//*[@id="business-info"]/dl/dd/ul')
for tag in prod.getchildren():
    print(tag.text)

蓝精灵童鞋 发表于 2021-3-16 15:36:46

当使用不同的标签并希望从该文本时,在已找到的部分上使用html2文本更容易
import html2text

data = '''\
<dd>
A block of text here.... bla bla bla....
<ul>
    <li><p>Item 1.for some reason they wraped this in a p</p></li>
    <li><strong>And this item is important</strong>bla bla bla</li>
    <li>And just more info here...</li>
</ul>
And finally more stuff here...
</dd>'''

text = html2text.HTML2Text()
text.mark_code = True
text.ignore_emphasis = True
text.single_line_break = True
text.ignore_links = True
text = text.handle(data)
print(text.strip())
页: [1]
查看完整版本: 需要帮助与 lxml.html 和 xpath