使用 BeautifulSoup 提取 CData

Question

我正在尝试使用 bs4/Python 3 中的 BeautifulSoup 来提取 CData。但是，每当我使用以下内容搜索它时，它 returns 都是一个空结果。谁能指出我做错了什么？

from bs4 import BeautifulSoup,CData

txt = '''<foobar>We have
         <![CDATA[some data here]]>
         and more.
         </foobar>'''
soup = BeautifulSoup(txt)
for cd in soup.findAll(text=True):
    if isinstance(cd, CData):
        print('CData contents: %r' % cd)

Answer 1

问题似乎是默认解析器没有正确解析 CDATA。如果指定正确的解析器，CDATA 会显示：

soup = BeautifulSoup(txt,'html.parser')

有关解析器的详细信息，请参阅the docs

我通过使用 the diagnose function, which the docs 推荐：

If you have questions about Beautiful Soup, or run into problems, send mail to the discussion group. If your problem involves parsing an HTML document, be sure to mention what the diagnose() function says about that document.

使用 diagnostic() 函数可以输出不同解析器如何看待您的 html，这使您能够为您的用例选择正确的解析器。

使用 BeautifulSoup 提取 CData

Using BeautifulSoup to Extract CData

python

beautifulsoup

cdata

python-3.x