如何在XML到Python插入父节点?

How to insert parent node in XML through Python?

我有一个这样结构的 XML 文件,我想在每次坐标中有一定距离时插入标签 "newline"(这里的例子,在文件中它们都是不同的) 提供:

 <?xml version="1.0" encoding="utf-8"?>
<pages>
    <page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
        <textbox id="0" bbox="179.739,592.028,261.007,604.510">
            <textline bbox="179.739,592.028,261.007,604.510">
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">C</text>
                <text font="NUMPTY+ImprintMTnum-it"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.333">A</text>
                <text font="NUMPTY+ImprintMTnum-it"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.333">P</text>
                <text font="NUMPTY+ImprintMTnum-it"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.333">I</text>
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">T</text>
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">O</text>
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">L</text>
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">O</text>
                <text></text>
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
                <text></text>
            </textline>
        </textbox>
    </page>
</pages>

但是,我的代码无法像我打印的树一样工作,我没有发现换行符的痕迹。它应该包装文本标签直到下一个标签,例如:

<newline><text></text></newline><newline><text></text></newline>

等等

密码是:

import xml.etree.ElementTree as ET
import lxml.etree as etree
tree = ET.parse("fe2.xml")
root = tree.getroot()
node = ET.Element('newline')


for child in root.iter():
    if child.tag == 'text':
        #print(child.tag, child.attrib.items())
        for name, value in child.attrib.items():
                if name == 'bbox':
                        value = tuple(value.split(","))
                        x1 = float(value[0])
                        x2 = float(value[2])
                        distance = x2 - x1
                        if distance > 10:
                                root.insert(3, node)
                                xml_str = ET.tostring(root, encoding='unicode')
                                print(xml_str)

我怎样才能完成这项工作?

要完成您的任务,使用 lxmlElementTree 更容易。 所以我使用了以下导入:

import lxml.etree as etree
from lxml.builder import E

第二次导入提供了一个新元素工厂。

为了便于识别元素,我稍微改变了数值 在源文件中(191. 之后的小数部分)。

为了打印漂亮的换行符,我按如下方式读取源文件:

parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse('input.xml', parser)
root = tree.getroot()

sequence 的 "good" 元素包装在 single newline元素, 进行如下:

  1. 定义一个函数,通过其索引从父元素中删除一个元素 和 return 这个元素:

    def removeByIdx(parent, idx):
        currElem = parent[idx]   # The indicated element
        parent.remove(currElem)  # Remove it from the parent
        return currElem          # Return the index and element
    
  2. 定义一个函数,用给定的索引包装 line(父项)的子项 在 newline 元素中:

    def wrap(line, idxList):
        if len(idxList) == 0:
            return    # No elements to wrap
        # Take the first element from the original location
        idx = idxList.pop(0)     # Index of the first element
        elem = removeByIdx(line, idx) # The indicated element
        # Create "newline" element with "elem" inside
        nElem = E.newline(elem)
        line.insert(idx, nElem)  # Put it in place of "elem"
        while len(idxList) > 0:  # Process the rest of index list
            # Value not used, but must be removed
            idxList.pop(0) 
            # Remove the current element from the original location
            currElem = removeByIdx(line, idx + 1)
            nElem.append(currElem)  # Append it to "newline"
    
  3. 阅读源XML树后,运行:

    for line in root.iter('textline'):
        idxList = []
        for elem in line:
            bbox = elem.attrib.get('bbox')
            if bbox is not None:
                tbl = bbox.split(',')
                distance = float(tbl[2]) - float(tbl[0])
            else:
                distance = 100  # "Too big" value
            if distance < 10:
                par = elem.getparent()
                idx = par.index(elem)
                idxList.append(idx)
            else:  # "Wrong" element, wrap elements "gathered" so far
                wrap(line, idxList)
                idxList = []
        # Process "good" elements without any "bad" after them, if any
        wrap(line, idxList)
    

然后我打印了结果树:

print(etree.tostring(root, encoding='unicode', pretty_print=True))

得到:

<pages>
  <page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
    <textbox id="0" bbox="179.739,592.028,261.007,604.510">
      <textline bbox="179.739,592.028,261.007,604.510">
        <newline>
          <text font="NUMPTY+ImprintMTnum" bbox="191.740,592.218,199.339,603.578" ncolour="0" size="12.482">C</text>
          <text font="NUMPTY+ImprintMTnum-it" bbox="191.741,592.218,199.339,603.578" ncolour="0" size="12.333">A</text>
          <text font="NUMPTY+ImprintMTnum-it" bbox="191.742,592.218,199.339,603.578" ncolour="0" size="12.333">P</text>
          <text font="NUMPTY+ImprintMTnum-it" bbox="191.743,592.218,199.339,603.578" ncolour="0" size="12.333">I</text>
          <text font="NUMPTY+ImprintMTnum" bbox="191.744,592.218,199.339,603.578" ncolour="0" size="12.482">T</text>
          <text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">O</text>
          <text font="NUMPTY+ImprintMTnum" bbox="191.746,592.218,199.339,603.578" ncolour="0" size="12.482">L</text>
          <text font="NUMPTY+ImprintMTnum" bbox="191.747,592.218,199.339,603.578" ncolour="0" size="12.482">O</text>
        </newline>
        <text/>
        <newline>
          <text font="NUMPTY+ImprintMTnum" bbox="191.748,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
          <text font="NUMPTY+ImprintMTnum" bbox="191.749,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
          <text font="NUMPTY+ImprintMTnum" bbox="191.750,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
        </newline>
        <text/>
      </textline>
    </textbox>
  </page>
</pages>