如何在XML到Python插入父节点?
How to insert parent node in XML through Python?
我有一个这样结构的 XML 文件,我想在每次坐标中有一定距离时插入标签 "newline"(这里的例子,在文件中它们都是不同的) 提供:
<?xml version="1.0" encoding="utf-8"?>
<pages>
<page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
<textbox id="0" bbox="179.739,592.028,261.007,604.510">
<textline bbox="179.739,592.028,261.007,604.510">
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">C</text>
<text font="NUMPTY+ImprintMTnum-it" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.333">A</text>
<text font="NUMPTY+ImprintMTnum-it" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.333">P</text>
<text font="NUMPTY+ImprintMTnum-it" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.333">I</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">T</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">O</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">L</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">O</text>
<text></text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
<text></text>
</textline>
</textbox>
</page>
</pages>
但是,我的代码无法像我打印的树一样工作,我没有发现换行符的痕迹。它应该包装文本标签直到下一个标签,例如:
<newline><text></text></newline><newline><text></text></newline>
等等
密码是:
import xml.etree.ElementTree as ET
import lxml.etree as etree
tree = ET.parse("fe2.xml")
root = tree.getroot()
node = ET.Element('newline')
for child in root.iter():
if child.tag == 'text':
#print(child.tag, child.attrib.items())
for name, value in child.attrib.items():
if name == 'bbox':
value = tuple(value.split(","))
x1 = float(value[0])
x2 = float(value[2])
distance = x2 - x1
if distance > 10:
root.insert(3, node)
xml_str = ET.tostring(root, encoding='unicode')
print(xml_str)
我怎样才能完成这项工作?
要完成您的任务,使用 lxml 比 ElementTree 更容易。
所以我使用了以下导入:
import lxml.etree as etree
from lxml.builder import E
第二次导入提供了一个新元素工厂。
为了便于识别元素,我稍微改变了数值
在源文件中(191. 之后的小数部分)。
为了打印漂亮的换行符,我按如下方式读取源文件:
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse('input.xml', parser)
root = tree.getroot()
将 sequence 的 "good" 元素包装在 single newline元素,
进行如下:
定义一个函数,通过其索引从父元素中删除一个元素
和 return 这个元素:
def removeByIdx(parent, idx):
currElem = parent[idx] # The indicated element
parent.remove(currElem) # Remove it from the parent
return currElem # Return the index and element
定义一个函数,用给定的索引包装 line(父项)的子项
在 newline 元素中:
def wrap(line, idxList):
if len(idxList) == 0:
return # No elements to wrap
# Take the first element from the original location
idx = idxList.pop(0) # Index of the first element
elem = removeByIdx(line, idx) # The indicated element
# Create "newline" element with "elem" inside
nElem = E.newline(elem)
line.insert(idx, nElem) # Put it in place of "elem"
while len(idxList) > 0: # Process the rest of index list
# Value not used, but must be removed
idxList.pop(0)
# Remove the current element from the original location
currElem = removeByIdx(line, idx + 1)
nElem.append(currElem) # Append it to "newline"
阅读源XML树后,运行:
for line in root.iter('textline'):
idxList = []
for elem in line:
bbox = elem.attrib.get('bbox')
if bbox is not None:
tbl = bbox.split(',')
distance = float(tbl[2]) - float(tbl[0])
else:
distance = 100 # "Too big" value
if distance < 10:
par = elem.getparent()
idx = par.index(elem)
idxList.append(idx)
else: # "Wrong" element, wrap elements "gathered" so far
wrap(line, idxList)
idxList = []
# Process "good" elements without any "bad" after them, if any
wrap(line, idxList)
然后我打印了结果树:
print(etree.tostring(root, encoding='unicode', pretty_print=True))
得到:
<pages>
<page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
<textbox id="0" bbox="179.739,592.028,261.007,604.510">
<textline bbox="179.739,592.028,261.007,604.510">
<newline>
<text font="NUMPTY+ImprintMTnum" bbox="191.740,592.218,199.339,603.578" ncolour="0" size="12.482">C</text>
<text font="NUMPTY+ImprintMTnum-it" bbox="191.741,592.218,199.339,603.578" ncolour="0" size="12.333">A</text>
<text font="NUMPTY+ImprintMTnum-it" bbox="191.742,592.218,199.339,603.578" ncolour="0" size="12.333">P</text>
<text font="NUMPTY+ImprintMTnum-it" bbox="191.743,592.218,199.339,603.578" ncolour="0" size="12.333">I</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.744,592.218,199.339,603.578" ncolour="0" size="12.482">T</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">O</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.746,592.218,199.339,603.578" ncolour="0" size="12.482">L</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.747,592.218,199.339,603.578" ncolour="0" size="12.482">O</text>
</newline>
<text/>
<newline>
<text font="NUMPTY+ImprintMTnum" bbox="191.748,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.749,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.750,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
</newline>
<text/>
</textline>
</textbox>
</page>
</pages>
我有一个这样结构的 XML 文件,我想在每次坐标中有一定距离时插入标签 "newline"(这里的例子,在文件中它们都是不同的) 提供:
<?xml version="1.0" encoding="utf-8"?>
<pages>
<page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
<textbox id="0" bbox="179.739,592.028,261.007,604.510">
<textline bbox="179.739,592.028,261.007,604.510">
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">C</text>
<text font="NUMPTY+ImprintMTnum-it" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.333">A</text>
<text font="NUMPTY+ImprintMTnum-it" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.333">P</text>
<text font="NUMPTY+ImprintMTnum-it" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.333">I</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">T</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">O</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">L</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">O</text>
<text></text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
<text></text>
</textline>
</textbox>
</page>
</pages>
但是,我的代码无法像我打印的树一样工作,我没有发现换行符的痕迹。它应该包装文本标签直到下一个标签,例如:
<newline><text></text></newline><newline><text></text></newline>
等等
密码是:
import xml.etree.ElementTree as ET
import lxml.etree as etree
tree = ET.parse("fe2.xml")
root = tree.getroot()
node = ET.Element('newline')
for child in root.iter():
if child.tag == 'text':
#print(child.tag, child.attrib.items())
for name, value in child.attrib.items():
if name == 'bbox':
value = tuple(value.split(","))
x1 = float(value[0])
x2 = float(value[2])
distance = x2 - x1
if distance > 10:
root.insert(3, node)
xml_str = ET.tostring(root, encoding='unicode')
print(xml_str)
我怎样才能完成这项工作?
要完成您的任务,使用 lxml 比 ElementTree 更容易。 所以我使用了以下导入:
import lxml.etree as etree
from lxml.builder import E
第二次导入提供了一个新元素工厂。
为了便于识别元素,我稍微改变了数值 在源文件中(191. 之后的小数部分)。
为了打印漂亮的换行符,我按如下方式读取源文件:
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse('input.xml', parser)
root = tree.getroot()
将 sequence 的 "good" 元素包装在 single newline元素, 进行如下:
定义一个函数,通过其索引从父元素中删除一个元素 和 return 这个元素:
def removeByIdx(parent, idx): currElem = parent[idx] # The indicated element parent.remove(currElem) # Remove it from the parent return currElem # Return the index and element
定义一个函数,用给定的索引包装 line(父项)的子项 在 newline 元素中:
def wrap(line, idxList): if len(idxList) == 0: return # No elements to wrap # Take the first element from the original location idx = idxList.pop(0) # Index of the first element elem = removeByIdx(line, idx) # The indicated element # Create "newline" element with "elem" inside nElem = E.newline(elem) line.insert(idx, nElem) # Put it in place of "elem" while len(idxList) > 0: # Process the rest of index list # Value not used, but must be removed idxList.pop(0) # Remove the current element from the original location currElem = removeByIdx(line, idx + 1) nElem.append(currElem) # Append it to "newline"
阅读源XML树后,运行:
for line in root.iter('textline'): idxList = [] for elem in line: bbox = elem.attrib.get('bbox') if bbox is not None: tbl = bbox.split(',') distance = float(tbl[2]) - float(tbl[0]) else: distance = 100 # "Too big" value if distance < 10: par = elem.getparent() idx = par.index(elem) idxList.append(idx) else: # "Wrong" element, wrap elements "gathered" so far wrap(line, idxList) idxList = [] # Process "good" elements without any "bad" after them, if any wrap(line, idxList)
然后我打印了结果树:
print(etree.tostring(root, encoding='unicode', pretty_print=True))
得到:
<pages>
<page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
<textbox id="0" bbox="179.739,592.028,261.007,604.510">
<textline bbox="179.739,592.028,261.007,604.510">
<newline>
<text font="NUMPTY+ImprintMTnum" bbox="191.740,592.218,199.339,603.578" ncolour="0" size="12.482">C</text>
<text font="NUMPTY+ImprintMTnum-it" bbox="191.741,592.218,199.339,603.578" ncolour="0" size="12.333">A</text>
<text font="NUMPTY+ImprintMTnum-it" bbox="191.742,592.218,199.339,603.578" ncolour="0" size="12.333">P</text>
<text font="NUMPTY+ImprintMTnum-it" bbox="191.743,592.218,199.339,603.578" ncolour="0" size="12.333">I</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.744,592.218,199.339,603.578" ncolour="0" size="12.482">T</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">O</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.746,592.218,199.339,603.578" ncolour="0" size="12.482">L</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.747,592.218,199.339,603.578" ncolour="0" size="12.482">O</text>
</newline>
<text/>
<newline>
<text font="NUMPTY+ImprintMTnum" bbox="191.748,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.749,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.750,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
</newline>
<text/>
</textline>
</textbox>
</page>
</pages>