如何使用 python 从 xml 中高效地提取 <![CDATA[]> 内容？

Question

我有以下 xml:

<?xml version="1.0" encoding="UTF-8" standalone="no"?><author id="user23">
    <document><![CDATA["@username: That boner came at the wrong time ???? http://t.co/5X34233gDyCaCjR" HELP I'M DYING       ]]></document>
    <document><![CDATA[Ugh      ]]></document>
    <document><![CDATA[YES !!!! WE GO FOR IT. http://t.co/fiI23324E83b0Rt       ]]></document>
    <document><![CDATA[@username Shout out to me????        ]]></document>
</author>

将<![CDATA[内容]]>解析并提取到列表中的最有效方法是什么。比方说：

[@username: That boner came at the wrong time ???? http://t.co/5X34233gDyCaCjR" HELP I'M DYING      Ugh     YES !!!! WE GO FOR IT. http://t.co/fiI23324E83b0Rt      @username Shout out to me????       ]

这是我试过的：

from bs4 import BeautifulSoup
x='/Users/user/PycharmProjects/TratandoDeMejorarPAN/test.xml'
y = BeautifulSoup(open(x), 'xml')
out = [y.author.document]
print out

这是输出：

[<document>"@username: That boner came at the wrong time ???? http://t.co/5XgDyCaCjR" HELP I'M DYING        </document>]

这个输出的问题是我不应该得到 <document></document>。如何删除 <document></document> 标签并在列表中获取此 xml 的所有元素？

Answer 1

这里有几处错误。（询问关于选择图书馆的问题是违反规则的，所以我忽略了这部分问题）。

您需要传入文件 handle，而不是文件 name.

即：y = BeautifulSoup(open(x))
你需要告诉BeautifulSoup它正在处理XML。

即：y = BeautifulSoup(open(x), 'xml')
CDATA 部分不创建元素。您无法在 DOM 中搜索它们，因为它们不存在于 DOM 中；它们只是语法糖。直接看document下面的文字就行了，不要试图搜索名字叫CDATA.
的东西
再次声明，略有不同：<doc><![CDATA[foo]]</doc>与<doc>foo</doc>完全相同。 CDATA 部分的不同之处在于其中的所有内容都会自动转义，这意味着 <![CDATA[<hello>]] 被解释为 <hello>。但是——您无法从解析的对象树中判断您的文档是否包含带有文字 < 和 > 的 CDATA 部分或带有 < 和 >。这是设计使然，任何符合 XML DOM 的实现都是如此。

现在，一些实际有效的代码怎么样：

import bs4

doc="""
<?xml version="1.0" encoding="UTF-8" standalone="no"?><author id="user23">
    <document><![CDATA["@username: That came at the wrong time ????" HELP I'M DYING       ]]></document>
    <document><![CDATA[Ugh      ]]></document>
    <document><![CDATA[YES !!!! WE GO FOR IT.       ]]></document>
    <document><![CDATA[@username Shout out to me????        ]]></document>
</author>
"""

doc_el = bs4.BeautifulSoup(doc, 'xml')
print [ el.text for el in doc_el.findAll('document') ]

如果要从文件中读取，请将 doc 替换为 open(filename, 'r')。

如何使用 python 从 xml 中高效地提取 <![CDATA[]> 内容？

How to extract efficientely <![CDATA[]> content from an xml with python?

python

xml

lxml

python-2.7

pandas