如何使用来自 xml.dom 的 minidom 获取字符串形式的内部内容?
How to get inner content as string using minidom from xml.dom?
我的 xml 文件中有一些文本标签(使用 popplers-utils 的 pdftohtml 将 pdf 转换为 xml),如下所示:
<text top="525" left="170" width="603" height="16" font="1">..part of old large book</text>
<text top="546" left="128" width="645" height="16" font="1">with many many pages and some <i>italics text among 'plain' text</i> and more and more text</text>
<text top="566" left="128" width="642" height="16" font="1">etc...</text>
我可以使用此示例代码获取包含文本标签的文本:
import string
from xml.dom import minidom
xmldoc = minidom.parse('../test/text.xml')
itemlist = xmldoc.getElementsByTagName('text')
some_tag = itemlist[node_index]
output_text = some_tag.firstChild.nodeValue
# if there is all text inside <i> I can get it by
output_text = some_tag.firstChild.firstChild.nodeValue
# but no if <i></i> wrap only one word of the string
但我无法获取 "nodeValue" 如果它包含另一个标签 (<i> or <b>...)
且无法获取对象
将所有文本作为纯字符串(如 javascript innerHTML 方法或递归到子标记中的最佳方法是什么,即使它们包含一些单词而不是整个 nodeValue?
谢谢
**Question: How to get inner content as string using minidom
这是一个递归解决方案,例如:
def getText(nodelist):
# Iterate all Nodes aggregate TEXT_NODE
rc = []
for node in nodelist:
if node.nodeType == node.TEXT_NODE:
rc.append(node.data)
else:
# Recursive
rc.append(getText(node.childNodes))
return ''.join(rc)
xmldoc = minidom.parse('../test/text.xml')
nodelist = xmldoc.getElementsByTagName('text')
# Iterate <text ..>...</text> Node List
for node in nodelist:
print(getText(node.childNodes))
Output:
..part of old large book
with many many pages and some italics text among 'plain' text and more and more text
etc...
使用 Python 测试:3.4.2
聚会太晚了...我遇到了类似的问题,只是我想要结果字符串中的标签。这是我的解决方案:
# Reconstruct this element's body XML from dom nodes
def getChildXML(elem):
out = ""
for c in elem.childNodes:
if c.nodeType == minidom.Node.TEXT_NODE:
out += c.nodeValue
else:
if c.nodeType == minidom.Node.ELEMENT_NODE:
if c.childNodes.length == 0:
out += "<" + c.nodeName + "/>"
else:
out += "<" + c.nodeName + ">"
cs = ""
cs = getChildXML(c)
out += cs
out += "</" + c.nodeName + ">"
return out
这应该 return 包含标签的确切 XML。
我的 xml 文件中有一些文本标签(使用 popplers-utils 的 pdftohtml 将 pdf 转换为 xml),如下所示:
<text top="525" left="170" width="603" height="16" font="1">..part of old large book</text>
<text top="546" left="128" width="645" height="16" font="1">with many many pages and some <i>italics text among 'plain' text</i> and more and more text</text>
<text top="566" left="128" width="642" height="16" font="1">etc...</text>
我可以使用此示例代码获取包含文本标签的文本:
import string
from xml.dom import minidom
xmldoc = minidom.parse('../test/text.xml')
itemlist = xmldoc.getElementsByTagName('text')
some_tag = itemlist[node_index]
output_text = some_tag.firstChild.nodeValue
# if there is all text inside <i> I can get it by
output_text = some_tag.firstChild.firstChild.nodeValue
# but no if <i></i> wrap only one word of the string
但我无法获取 "nodeValue" 如果它包含另一个标签 (<i> or <b>...)
且无法获取对象
将所有文本作为纯字符串(如 javascript innerHTML 方法或递归到子标记中的最佳方法是什么,即使它们包含一些单词而不是整个 nodeValue?
谢谢
**Question: How to get inner content as string using minidom
这是一个递归解决方案,例如:
def getText(nodelist):
# Iterate all Nodes aggregate TEXT_NODE
rc = []
for node in nodelist:
if node.nodeType == node.TEXT_NODE:
rc.append(node.data)
else:
# Recursive
rc.append(getText(node.childNodes))
return ''.join(rc)
xmldoc = minidom.parse('../test/text.xml')
nodelist = xmldoc.getElementsByTagName('text')
# Iterate <text ..>...</text> Node List
for node in nodelist:
print(getText(node.childNodes))
Output:
..part of old large book with many many pages and some italics text among 'plain' text and more and more text etc...
使用 Python 测试:3.4.2
聚会太晚了...我遇到了类似的问题,只是我想要结果字符串中的标签。这是我的解决方案:
# Reconstruct this element's body XML from dom nodes
def getChildXML(elem):
out = ""
for c in elem.childNodes:
if c.nodeType == minidom.Node.TEXT_NODE:
out += c.nodeValue
else:
if c.nodeType == minidom.Node.ELEMENT_NODE:
if c.childNodes.length == 0:
out += "<" + c.nodeName + "/>"
else:
out += "<" + c.nodeName + ">"
cs = ""
cs = getChildXML(c)
out += cs
out += "</" + c.nodeName + ">"
return out
这应该 return 包含标签的确切 XML。