排序 XML 基于属性保留 Python 中每个父节点的所有子节点

Sorting XML based on attributes retaining all children nodes for each parent node in Python

我有一个 xml 文件,我想根据属性值对其进行排序。以下是 xml 文件:

<?xml-stylesheet type='text/xsl' href='image_metadata_stylesheet.xsl'?>
<dataset>
  <name>imglab dataset</name>
  <comment>Created by imglab tool.</comment>
  <images>
    <image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00003.jpg">
      <box top="175" left="59" width="73" height="29">
        <label>groundpainting_hotstar</label>
      </box>
      <box top="174" left="205" width="56" height="24">
        <label>groundpainting_yesbank</label>
      </box>
      <box top="170" left="141" width="44" height="32">
        <label>groundpainting_vodafone</label>
      </box>
    </image>
    <image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00001.jpg"/>
    <image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00002.jpg">
      <box top="198" left="17" width="32" height="10">
        <label>sightscreen_pepsi</label>
      </box>
    </image>
 </images>
</dataset>

所需的输出是这样的:

<?xml-stylesheet type='text/xsl' href='image_metadata_stylesheet.xsl'?>
<dataset>
  <name>imglab dataset</name>
  <comment>Created by imglab tool.</comment>
  <images>
    <image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00001.jpg"/>
    <image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00002.jpg">
      <box top="198" left="17" width="32" height="10">
        <label>sightscreen_pepsi</label>
      </box>
    </image>
    <image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00003.jpg">
      <box top="175" left="59" width="73" height="29">
        <label>groundpainting_hotstar</label>
      </box>
      <box top="174" left="205" width="56" height="24">
        <label>groundpainting_yesbank</label>
      </box>
      <box top="170" left="141" width="44" height="32">
        <label>groundpainting_vodafone</label>
      </box>
    </image>
 </images>
</dataset>

我尝试了以下两个选项:

import xml.etree.ElementTree as ET
tree = ET.parse("finalxml.xml")
container = tree.find("images")
data = []
for elem in container:
    key = elem.findtext("image")
    data.append((key,elem))
data.sort()
container[:] = [item[-1] for item in data]
tree.write("new-data.xml")

此代码只是重新对齐框属性,而不是图像文件属性,这是不可取的。以下是我从 SO 中获取的内容,但没有做任何事情。

# =======================================================================
# Monkey patch ElementTree
import xml.etree.ElementTree as ET

def _serialize_xml(write, elem, encoding, qnames, namespaces):
    tag = elem.tag
    text = elem.text
    if tag is ET.Comment:
        write("<!--%s-->" % ET._encode(text, encoding))
    elif tag is ET.ProcessingInstruction:
        write("<?%s?>" % ET._encode(text, encoding))
    else:
        tag = qnames[tag]
        if tag is None:
            if text:
                write(ET._escape_cdata(text, encoding))
            for e in elem:
                _serialize_xml(write, e, encoding, qnames, None)
        else:
            write("<" + tag)
            items = elem.items()
            if items or namespaces:
                if namespaces:
                    for v, k in sorted(namespaces.items(),
                                       key=lambda x: x[1]):  # sort on prefix
                        if k:
                            k = ":" + k
                        write(" xmlns%s=\"%s\"" % (
                            k.encode(encoding),
                            ET._escape_attrib(v, encoding)
                            ))
                #for k, v in sorted(items):  # lexical order
                for k, v in items: # Monkey patch
                    if isinstance(k, ET.QName):
                        k = k.text
                    if isinstance(v, ET.QName):
                        v = qnames[v.text]
                    else:
                        v = ET._escape_attrib(v, encoding)
                    write(" %s=\"%s\"" % (qnames[k], v))
            if text or len(elem):
                write(">")
                if text:
                    write(ET._escape_cdata(text, encoding))
                for e in elem:
                    _serialize_xml(write, e, encoding, qnames, None)
                write("</" + tag + ">")
            else:
                write(" />")
    if elem.tail:
        write(ET._escape_cdata(elem.tail, encoding))

ET._serialize_xml = _serialize_xml

from collections import OrderedDict

class OrderedXMLTreeBuilder(ET.XMLTreeBuilder):
    def _start_list(self, tag, attrib_in):
        fixname = self._fixname
        tag = fixname(tag)
        attrib = OrderedDict()
        if attrib_in:
            for i in range(0, len(attrib_in), 2):
                attrib[fixname(attrib_in[i])] = self._fixtext(attrib_in[i+1])
        return self._target.start(tag, attrib)


tree = ET.parse("example1.xml", OrderedXMLTreeBuilder())
tree.write("new-data.xml")

如何对 xml 进行排序?

使用 list.sortkey 命名参数来使用每个 <image> 标签的 file 属性作为排序的键:

key specifies a function of one argument that is used to extract a comparison key from each list element (for example, key=str.lower). The key corresponding to each item in the list is calculated once and then used for the entire sorting process. The default value of None means that list items are sorted directly without calculating a separate key value.

import xml.etree.ElementTree

xml_string = r'''<?xml-stylesheet type='text/xsl' href='image_metadata_stylesheet.xsl'?>
<dataset>
  <name>imglab dataset</name>
  <comment>Created by imglab tool.</comment>
  <images>
    <image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00003.jpg">
      <box top="175" left="59" width="73" height="29">
        <label>groundpainting_hotstar</label>
      </box>
      <box top="174" left="205" width="56" height="24">
        <label>groundpainting_yesbank</label>
      </box>
      <box top="170" left="141" width="44" height="32">
        <label>groundpainting_vodafone</label>
      </box>
    </image>
    <image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00001.jpg"/>
    <image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00002.jpg">
      <box top="198" left="17" width="32" height="10">
        <label>sightscreen_pepsi</label>
      </box>
    </image>
 </images>
</dataset>'''

root = xml.etree.ElementTree.fromstring(xml_string)
images_root = root.find('images')
images = images_root.findall('image')
images.sort(key = lambda x: x.attrib['file'])
images_root[:] = images

print(xml.etree.ElementTree.tostring(root))

基于 this answer 使用 lxml 的替代解决方案指出 lxml 按属性设置的顺序序列化属性(与 xml 不同):

import lxml.etree

xml_string = r'''<?xml-stylesheet type='text/xsl' href='image_metadata_stylesheet.xsl'?>
<dataset>
  <name>imglab dataset</name>
  <comment>Created by imglab tool.</comment>
  <images>
    <text>lol</text>
    <image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00003.jpg">
      <box top="175" left="59" width="73" height="29">
        <label>groundpainting_hotstar</label>
      </box>
      <box top="174" left="205" width="56" height="24">
        <label>groundpainting_yesbank</label>
      </box>
      <box top="170" left="141" width="44" height="32">
        <label>groundpainting_vodafone</label>
      </box>
    </image>
    <image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00001.jpg"/>
    <image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00002.jpg">
      <box top="198" left="17" width="32" height="10">
        <label>sightscreen_pepsi</label>
      </box>
    </image>
 </images>
</dataset>'''

root = lxml.etree.fromstring(xml_string)
images_root = root.find('images')
images = images_root.findall('image')
images.sort(key = lambda x: x.attrib['file'])
images_root[:] = images

print(lxml.etree.tostring(root))

注意:这将删除 <images> 的任何不是 <image> 的子代(直系后代)。