使用 lxml 读取 CDATA，行尾问题

Question

您好，我正在解析一个包含大量 CDATA 部分的 xml 文档。到目前为止，我一直没有遇到任何问题。我意识到，当我阅读 an 元素并获取文本 abribute 时，我在开头和文本末尾都得到了行尾字符。

一段重要代码如下：

for comments in self.xml.iter("Comments"):
    for comment in comments.iter("Comment"):
        description = comment.get('Description')

        if language == "Arab":
            tag = self.name + description
            text = comment.text

问题出在元素评论，他是这样写的：

<Comment>
<![CDATA[Usually made it with not reason]]>

我尝试获取文本属性，结果是这样的：

\nUsually made it with not reason\n

我知道我可以做脱衣舞等等。但我想从根本上解决问题，也许在用elementree解析之前有一些选择。

当我解析 xml 文件时，我是这样做的：

tree = ET.parse(xml)

最小的可重现示例

import xml.etree.ElementTree as ET

filename = test.xml  #Place here your path test xml file

tree = ET.parse(filename)
root = tree.getroot()
Description = root[0]
text = Description.text

print (text)

最小xml 文件

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Description>
<![CDATA[Hello world]]>
</Description>

Answer 1

你得到换行符是因为有个换行符：

<Comment>
<![CDATA[Usually made it with not reason]]>
</Comment>

为什么 <![CDATA 和 </Comment 会另行开始？

如果您不需要换行符，请将其删除：

<Comment><![CDATA[Usually made it with not reason]]></Comment>

元素内的所有内容 都计入其字符串值。

<![CDATA[...]]> 不是一个元素，它是一个解析器标志。它改变了 XML 解析器读取封闭字符的方式。您可以在同一个元素中有多个 CDATA 部分，在“常规模式”和“cdata 模式”之间随意切换：

<Comment>normal text <![CDATA[
    CDATA mode, this may contain <unescaped> Characters!
]]> now normal text again
<![CDATA[more special text]]> now normal text again
</Comment>

CDATA 部分前后的任何换行符都计入“普通文本”部分。当解析器读取它时，它将创建一个由各个部分组成的长字符串：

normal text 
    CDATA mode, this may contain <unescaped> Characters!
 now normal text again
more special text now normal text again

Answer 2

我认为当 CDATA 出现在 xml 时，它们在开始和结束时都带有行尾，就像那样。

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Description>
<![CDATA[Hello world]]>
</Description>

但你也可以这样。

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Description><![CDATA[Hello world]]></Description>

这就是我们在使用 Elementtree 库解析时获取行尾字符的原因，在这两种情况下都工作得很好，您只需要剥离或不剥离取决于您要如何处理数据。

如果你想删除两个 '\n' 只需添加以下代码:

text = Description.text
text = text.strip('\n')

使用 lxml 读取 CDATA，行尾问题

Reading CDATA with lxml, problem with end of line

elementtree

xml-parsing

python-3.x