排序 XML 基于属性保留 Python 中每个父节点的所有子节点
Sorting XML based on attributes retaining all children nodes for each parent node in Python
我有一个 xml 文件,我想根据属性值对其进行排序。以下是 xml 文件:
<?xml-stylesheet type='text/xsl' href='image_metadata_stylesheet.xsl'?>
<dataset>
<name>imglab dataset</name>
<comment>Created by imglab tool.</comment>
<images>
<image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00003.jpg">
<box top="175" left="59" width="73" height="29">
<label>groundpainting_hotstar</label>
</box>
<box top="174" left="205" width="56" height="24">
<label>groundpainting_yesbank</label>
</box>
<box top="170" left="141" width="44" height="32">
<label>groundpainting_vodafone</label>
</box>
</image>
<image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00001.jpg"/>
<image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00002.jpg">
<box top="198" left="17" width="32" height="10">
<label>sightscreen_pepsi</label>
</box>
</image>
</images>
</dataset>
所需的输出是这样的:
<?xml-stylesheet type='text/xsl' href='image_metadata_stylesheet.xsl'?>
<dataset>
<name>imglab dataset</name>
<comment>Created by imglab tool.</comment>
<images>
<image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00001.jpg"/>
<image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00002.jpg">
<box top="198" left="17" width="32" height="10">
<label>sightscreen_pepsi</label>
</box>
</image>
<image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00003.jpg">
<box top="175" left="59" width="73" height="29">
<label>groundpainting_hotstar</label>
</box>
<box top="174" left="205" width="56" height="24">
<label>groundpainting_yesbank</label>
</box>
<box top="170" left="141" width="44" height="32">
<label>groundpainting_vodafone</label>
</box>
</image>
</images>
</dataset>
我尝试了以下两个选项:
import xml.etree.ElementTree as ET
tree = ET.parse("finalxml.xml")
container = tree.find("images")
data = []
for elem in container:
key = elem.findtext("image")
data.append((key,elem))
data.sort()
container[:] = [item[-1] for item in data]
tree.write("new-data.xml")
此代码只是重新对齐框属性,而不是图像文件属性,这是不可取的。以下是我从 SO 中获取的内容,但没有做任何事情。
# =======================================================================
# Monkey patch ElementTree
import xml.etree.ElementTree as ET
def _serialize_xml(write, elem, encoding, qnames, namespaces):
tag = elem.tag
text = elem.text
if tag is ET.Comment:
write("<!--%s-->" % ET._encode(text, encoding))
elif tag is ET.ProcessingInstruction:
write("<?%s?>" % ET._encode(text, encoding))
else:
tag = qnames[tag]
if tag is None:
if text:
write(ET._escape_cdata(text, encoding))
for e in elem:
_serialize_xml(write, e, encoding, qnames, None)
else:
write("<" + tag)
items = elem.items()
if items or namespaces:
if namespaces:
for v, k in sorted(namespaces.items(),
key=lambda x: x[1]): # sort on prefix
if k:
k = ":" + k
write(" xmlns%s=\"%s\"" % (
k.encode(encoding),
ET._escape_attrib(v, encoding)
))
#for k, v in sorted(items): # lexical order
for k, v in items: # Monkey patch
if isinstance(k, ET.QName):
k = k.text
if isinstance(v, ET.QName):
v = qnames[v.text]
else:
v = ET._escape_attrib(v, encoding)
write(" %s=\"%s\"" % (qnames[k], v))
if text or len(elem):
write(">")
if text:
write(ET._escape_cdata(text, encoding))
for e in elem:
_serialize_xml(write, e, encoding, qnames, None)
write("</" + tag + ">")
else:
write(" />")
if elem.tail:
write(ET._escape_cdata(elem.tail, encoding))
ET._serialize_xml = _serialize_xml
from collections import OrderedDict
class OrderedXMLTreeBuilder(ET.XMLTreeBuilder):
def _start_list(self, tag, attrib_in):
fixname = self._fixname
tag = fixname(tag)
attrib = OrderedDict()
if attrib_in:
for i in range(0, len(attrib_in), 2):
attrib[fixname(attrib_in[i])] = self._fixtext(attrib_in[i+1])
return self._target.start(tag, attrib)
tree = ET.parse("example1.xml", OrderedXMLTreeBuilder())
tree.write("new-data.xml")
如何对 xml 进行排序?
使用 list.sort
的 key
命名参数来使用每个 <image>
标签的 file
属性作为排序的键:
key specifies a function of one argument that is used to extract a comparison key from each list element (for example, key=str.lower). The key corresponding to each item in the list is calculated once and then used for the entire sorting process. The default value of None means that list items are sorted directly without calculating a separate key value.
import xml.etree.ElementTree
xml_string = r'''<?xml-stylesheet type='text/xsl' href='image_metadata_stylesheet.xsl'?>
<dataset>
<name>imglab dataset</name>
<comment>Created by imglab tool.</comment>
<images>
<image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00003.jpg">
<box top="175" left="59" width="73" height="29">
<label>groundpainting_hotstar</label>
</box>
<box top="174" left="205" width="56" height="24">
<label>groundpainting_yesbank</label>
</box>
<box top="170" left="141" width="44" height="32">
<label>groundpainting_vodafone</label>
</box>
</image>
<image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00001.jpg"/>
<image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00002.jpg">
<box top="198" left="17" width="32" height="10">
<label>sightscreen_pepsi</label>
</box>
</image>
</images>
</dataset>'''
root = xml.etree.ElementTree.fromstring(xml_string)
images_root = root.find('images')
images = images_root.findall('image')
images.sort(key = lambda x: x.attrib['file'])
images_root[:] = images
print(xml.etree.ElementTree.tostring(root))
基于 this answer 使用 lxml
的替代解决方案指出 lxml
按属性设置的顺序序列化属性(与 xml
不同):
import lxml.etree
xml_string = r'''<?xml-stylesheet type='text/xsl' href='image_metadata_stylesheet.xsl'?>
<dataset>
<name>imglab dataset</name>
<comment>Created by imglab tool.</comment>
<images>
<text>lol</text>
<image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00003.jpg">
<box top="175" left="59" width="73" height="29">
<label>groundpainting_hotstar</label>
</box>
<box top="174" left="205" width="56" height="24">
<label>groundpainting_yesbank</label>
</box>
<box top="170" left="141" width="44" height="32">
<label>groundpainting_vodafone</label>
</box>
</image>
<image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00001.jpg"/>
<image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00002.jpg">
<box top="198" left="17" width="32" height="10">
<label>sightscreen_pepsi</label>
</box>
</image>
</images>
</dataset>'''
root = lxml.etree.fromstring(xml_string)
images_root = root.find('images')
images = images_root.findall('image')
images.sort(key = lambda x: x.attrib['file'])
images_root[:] = images
print(lxml.etree.tostring(root))
注意:这将删除 <images>
的任何不是 <image>
的子代(直系后代)。
我有一个 xml 文件,我想根据属性值对其进行排序。以下是 xml 文件:
<?xml-stylesheet type='text/xsl' href='image_metadata_stylesheet.xsl'?>
<dataset>
<name>imglab dataset</name>
<comment>Created by imglab tool.</comment>
<images>
<image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00003.jpg">
<box top="175" left="59" width="73" height="29">
<label>groundpainting_hotstar</label>
</box>
<box top="174" left="205" width="56" height="24">
<label>groundpainting_yesbank</label>
</box>
<box top="170" left="141" width="44" height="32">
<label>groundpainting_vodafone</label>
</box>
</image>
<image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00001.jpg"/>
<image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00002.jpg">
<box top="198" left="17" width="32" height="10">
<label>sightscreen_pepsi</label>
</box>
</image>
</images>
</dataset>
所需的输出是这样的:
<?xml-stylesheet type='text/xsl' href='image_metadata_stylesheet.xsl'?>
<dataset>
<name>imglab dataset</name>
<comment>Created by imglab tool.</comment>
<images>
<image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00001.jpg"/>
<image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00002.jpg">
<box top="198" left="17" width="32" height="10">
<label>sightscreen_pepsi</label>
</box>
</image>
<image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00003.jpg">
<box top="175" left="59" width="73" height="29">
<label>groundpainting_hotstar</label>
</box>
<box top="174" left="205" width="56" height="24">
<label>groundpainting_yesbank</label>
</box>
<box top="170" left="141" width="44" height="32">
<label>groundpainting_vodafone</label>
</box>
</image>
</images>
</dataset>
我尝试了以下两个选项:
import xml.etree.ElementTree as ET
tree = ET.parse("finalxml.xml")
container = tree.find("images")
data = []
for elem in container:
key = elem.findtext("image")
data.append((key,elem))
data.sort()
container[:] = [item[-1] for item in data]
tree.write("new-data.xml")
此代码只是重新对齐框属性,而不是图像文件属性,这是不可取的。以下是我从 SO 中获取的内容,但没有做任何事情。
# =======================================================================
# Monkey patch ElementTree
import xml.etree.ElementTree as ET
def _serialize_xml(write, elem, encoding, qnames, namespaces):
tag = elem.tag
text = elem.text
if tag is ET.Comment:
write("<!--%s-->" % ET._encode(text, encoding))
elif tag is ET.ProcessingInstruction:
write("<?%s?>" % ET._encode(text, encoding))
else:
tag = qnames[tag]
if tag is None:
if text:
write(ET._escape_cdata(text, encoding))
for e in elem:
_serialize_xml(write, e, encoding, qnames, None)
else:
write("<" + tag)
items = elem.items()
if items or namespaces:
if namespaces:
for v, k in sorted(namespaces.items(),
key=lambda x: x[1]): # sort on prefix
if k:
k = ":" + k
write(" xmlns%s=\"%s\"" % (
k.encode(encoding),
ET._escape_attrib(v, encoding)
))
#for k, v in sorted(items): # lexical order
for k, v in items: # Monkey patch
if isinstance(k, ET.QName):
k = k.text
if isinstance(v, ET.QName):
v = qnames[v.text]
else:
v = ET._escape_attrib(v, encoding)
write(" %s=\"%s\"" % (qnames[k], v))
if text or len(elem):
write(">")
if text:
write(ET._escape_cdata(text, encoding))
for e in elem:
_serialize_xml(write, e, encoding, qnames, None)
write("</" + tag + ">")
else:
write(" />")
if elem.tail:
write(ET._escape_cdata(elem.tail, encoding))
ET._serialize_xml = _serialize_xml
from collections import OrderedDict
class OrderedXMLTreeBuilder(ET.XMLTreeBuilder):
def _start_list(self, tag, attrib_in):
fixname = self._fixname
tag = fixname(tag)
attrib = OrderedDict()
if attrib_in:
for i in range(0, len(attrib_in), 2):
attrib[fixname(attrib_in[i])] = self._fixtext(attrib_in[i+1])
return self._target.start(tag, attrib)
tree = ET.parse("example1.xml", OrderedXMLTreeBuilder())
tree.write("new-data.xml")
如何对 xml 进行排序?
使用 list.sort
的 key
命名参数来使用每个 <image>
标签的 file
属性作为排序的键:
key specifies a function of one argument that is used to extract a comparison key from each list element (for example, key=str.lower). The key corresponding to each item in the list is calculated once and then used for the entire sorting process. The default value of None means that list items are sorted directly without calculating a separate key value.
import xml.etree.ElementTree
xml_string = r'''<?xml-stylesheet type='text/xsl' href='image_metadata_stylesheet.xsl'?>
<dataset>
<name>imglab dataset</name>
<comment>Created by imglab tool.</comment>
<images>
<image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00003.jpg">
<box top="175" left="59" width="73" height="29">
<label>groundpainting_hotstar</label>
</box>
<box top="174" left="205" width="56" height="24">
<label>groundpainting_yesbank</label>
</box>
<box top="170" left="141" width="44" height="32">
<label>groundpainting_vodafone</label>
</box>
</image>
<image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00001.jpg"/>
<image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00002.jpg">
<box top="198" left="17" width="32" height="10">
<label>sightscreen_pepsi</label>
</box>
</image>
</images>
</dataset>'''
root = xml.etree.ElementTree.fromstring(xml_string)
images_root = root.find('images')
images = images_root.findall('image')
images.sort(key = lambda x: x.attrib['file'])
images_root[:] = images
print(xml.etree.ElementTree.tostring(root))
基于 this answer 使用 lxml
的替代解决方案指出 lxml
按属性设置的顺序序列化属性(与 xml
不同):
import lxml.etree
xml_string = r'''<?xml-stylesheet type='text/xsl' href='image_metadata_stylesheet.xsl'?>
<dataset>
<name>imglab dataset</name>
<comment>Created by imglab tool.</comment>
<images>
<text>lol</text>
<image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00003.jpg">
<box top="175" left="59" width="73" height="29">
<label>groundpainting_hotstar</label>
</box>
<box top="174" left="205" width="56" height="24">
<label>groundpainting_yesbank</label>
</box>
<box top="170" left="141" width="44" height="32">
<label>groundpainting_vodafone</label>
</box>
</image>
<image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00001.jpg"/>
<image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00002.jpg">
<box top="198" left="17" width="32" height="10">
<label>sightscreen_pepsi</label>
</box>
</image>
</images>
</dataset>'''
root = lxml.etree.fromstring(xml_string)
images_root = root.find('images')
images = images_root.findall('image')
images.sort(key = lambda x: x.attrib['file'])
images_root[:] = images
print(lxml.etree.tostring(root))
注意:这将删除 <images>
的任何不是 <image>
的子代(直系后代)。