使用 lxml 解析 xml 时在标记中保留命名空间前缀
Retain namespace prefix in a tag when parsing xml using lxml
我有一个 xml 如下。很少有前缀为 ce
的标签,例如 <ce:title>
。当我使用 xpath 运行 下面的代码时,在输出中,<ce:title>
被替换为 <title>
。我确实在 SO 上看到了其他链接,例如 How to preserve namespace information when parsing HTML with lxml?,但不确定在何处以及如何添加名称空间详细信息。
有人可以建议吗?如何为 xml 以下保留 <ce:title>
?
from lxml import html
from lxml.etree import tostring
with open('102277033304.xml', encoding='utf-8') as file_object:
xml = file_object.read().strip()
root = html.fromstring(xml)
for element in root.xpath('//item/book/pages/*'):
html = tostring(element, encoding='utf-8')
print(html)
XML:
<item>
<book>
<pages>
<page-info>
<page>
<ce:title>Chapter 1</ce:title>
<content>Welcome to Chapter 1</content>
</page>
<page>
<ce:title>Chapter 2</ce:title>
<content>Welcome to Chapter 2</content>
</page>
</page-info>
<page-fulltext>Published in page 1</page-fulltext>
<page-info>
<page>
<ce:title>Chapter 1</ce:title>
<content>Welcome to Chapter 1</content>
</page>
<page>
<ce:title>Chapter 2</ce:title>
<content>Welcome to Chapter 2</content>
</page>
</page-info>
<page-fulltext>Published in page 2</page-fulltext>
<page-info>
<page>
<ce:title>Chapter 1</ce:title>
<content>Welcome to Chapter 1</content>
</page>
<page>
<ce:title>Chapter 2</ce:title>
<content>Welcome to Chapter 2</content>
</page>
</page-info>
<page-fulltext>Published in page 3</page-fulltext>
</pages>
</book>
</item>
这可能是因为您正在使用 html 解析器来读取 xml。
这样试试:
from lxml import etree
root = etree.XML(xml)
for element in root.xpath('//item/book/pages/*'):
xml = etree.tostring(element, encoding='utf-8')
print(xml)
这应该会给你预期的输出。
我有一个 xml 如下。很少有前缀为 ce
的标签,例如 <ce:title>
。当我使用 xpath 运行 下面的代码时,在输出中,<ce:title>
被替换为 <title>
。我确实在 SO 上看到了其他链接,例如 How to preserve namespace information when parsing HTML with lxml?,但不确定在何处以及如何添加名称空间详细信息。
有人可以建议吗?如何为 xml 以下保留 <ce:title>
?
from lxml import html
from lxml.etree import tostring
with open('102277033304.xml', encoding='utf-8') as file_object:
xml = file_object.read().strip()
root = html.fromstring(xml)
for element in root.xpath('//item/book/pages/*'):
html = tostring(element, encoding='utf-8')
print(html)
XML:
<item>
<book>
<pages>
<page-info>
<page>
<ce:title>Chapter 1</ce:title>
<content>Welcome to Chapter 1</content>
</page>
<page>
<ce:title>Chapter 2</ce:title>
<content>Welcome to Chapter 2</content>
</page>
</page-info>
<page-fulltext>Published in page 1</page-fulltext>
<page-info>
<page>
<ce:title>Chapter 1</ce:title>
<content>Welcome to Chapter 1</content>
</page>
<page>
<ce:title>Chapter 2</ce:title>
<content>Welcome to Chapter 2</content>
</page>
</page-info>
<page-fulltext>Published in page 2</page-fulltext>
<page-info>
<page>
<ce:title>Chapter 1</ce:title>
<content>Welcome to Chapter 1</content>
</page>
<page>
<ce:title>Chapter 2</ce:title>
<content>Welcome to Chapter 2</content>
</page>
</page-info>
<page-fulltext>Published in page 3</page-fulltext>
</pages>
</book>
</item>
这可能是因为您正在使用 html 解析器来读取 xml。
这样试试:
from lxml import etree
root = etree.XML(xml)
for element in root.xpath('//item/book/pages/*'):
xml = etree.tostring(element, encoding='utf-8')
print(xml)
这应该会给你预期的输出。