使用 BeautifulSoup 4（lxml 解析器），如何从标签中提取内部 HTML（decode_contents 无效）？

Question

我正在使用 BeautifulSoup 4 和 Python 3.7。我想从找到的文章中提取内部 HTML。我有这个

soup = BeautifulSoup(html, features="lxml")
...
article_elt = top_article_elt.select('div[class*="outer"]')[0]
article = article_elt.decode_contents()
...
print("article: " + str(article) + " score:" + str(score))

但是，打印出来的内容包括外部标签...

article: <div class="outer"><p>Top story of the year.</p>
</div>

如何编写仅提取内部 HTML 的语句？

Answer 1

一个快速解决方法是使用 .find():

深入一层

article = article_elt.find().decode_contents()

但是，这可能只是治标不治本。感觉是您用 class="outer" 嵌套了 div 元素，或者 class*="outer" 检查匹配了树上的一些意外元素。尝试：

article_elt = top_article_elt.select_one('div.outer')

使用 BeautifulSoup 4（lxml 解析器），如何从标签中提取内部 HTML（decode_contents 无效）？

With BeautifulSoup 4 (lxml parser), how do I extract inner HTML from a tag (decode_contents not working)?

python

beautifulsoup

innerhtml

python-3.x