尽管具有属性,如何在 Python XML 的同一父级中加入具有相同标签的元素?
How to join elements with same tag within the same parent in Python XML despite their attributes?
我有一个 XML 结构如下:
<?xml version="1.0" encoding="utf-8" ?>
<pages>
<page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
<textbox id="0" bbox="179.739,592.028,261.007,604.510">
<textline bbox="179.739,592.028,261.007,604.510">
<text font="NUMPTY+ImprintMTnum" ncolour="0" size="12.482">C</text>
<text font="NUMPTY+ImprintMTnum" ncolour="0" size="12.482">A</text>
<text font="NUMPTY+ImprintMTnum" ncolour="0" size="12.482">P</text>
<text font="NUMPTY+ImprintMTnum" ncolour="0" size="12.482">I</text>
<text font="NUMPTY+ImprintMTnum" ncolour="0" size="12.482">T</text>
<text font="NUMPTY+ImprintMTnum" ncolour="0" size="12.482">O</text>
<text font="NUMPTY+ImprintMTnum" ncolour="0" size="12.482">L</text>
<text font="NUMPTY+ImprintMTnum" ncolour="0" size="12.482">O</text>
<text> </text>
<text font="NUMPTY+ImprintMTnum" ncolour="0" size="12.482">I</text>
<text font="NUMPTY+ImprintMTnum" ncolour="0" size="12.482">I</text>
<text font="NUMPTY+ImprintMTnum" ncolour="0" size="12.482">I</text>
<text>
</text>
</textline>
</textbox>
</page>
</pages>
我想在同一父级(文本行)中合并所有具有相同文本大小的 text
标签,以便连接各个字母。标签页面、页面和文本框将被保留。我想保持字母出现的顺序,像这样:
<?xml version="1.0" encoding="utf-8" ?>
<pages>
<page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
<textbox id="0" bbox="179.739,592.028,261.007,604.510">
<textline bbox="179.739,592.028,261.007,604.510">
<text font="NUMPTY+ImprintMTnum" ncolour="0" size="12.482">CAPITOLO III</text>
</textline>
</textbox>
</page>
</pages>
我尝试在互联网上查找,但没有成功。这是我尝试过的:
import xml.etree.ElementTree as ET
MY_XML = ET.parse('fe.xml')
group_list = MY_XML.findall("./pages/page/textbox/textline") # I do this because the actual xml is bigger with several groups
text_list = []
for group in group_list:
string_text = ""
for child in group :
for super_child in child:
if(super_child.text is not None): #Just in case None value because I cannot use string addition
string_text = string_text + super_child.text + " "
text_list.append(string_text)
#I stored all the info in 1 group as a value in this list because like I stated my overall xml might be bigger with more than 1 group
for group in group_list:
for elem in group.findall("./pages/page/textbox/textline/text"):
#loop over all possible <group> and removes all <group_info> inside
group.remove(elem)
#And finally to append the information gathered:
for group in group_list:
Text_elem = ET.Element("text")
Text_elem.text = text_list[group_list.index(group)]
group.append(Text_elem)
print(group_list)
不知道怎么弄,求大神指点。
代码中的一个问题是 MY_XML.findall("./pages/page/textbox/textline")
returns 是一个空列表。根元素是 pages
,它是 findall()
的上下文。所以 findall("./page/textbox/textline")
会起作用。
这是一个产生所需输出的程序:
import xml.etree.ElementTree as ET
MY_XML = ET.parse('fe.xml')
textlines = MY_XML.findall("./page/textbox/textline")
for textline in textlines:
fulltext = []
for text_elem in list(textline):
# Get the text of each 'text' element and then remove it
fulltext.append(text_elem.text)
textline.remove(text_elem)
# Create a new 'text' element and add the joined letters to it
new_text_elem = ET.Element("text", font="NUMPTY+ImprintMTnum", ncolour="0", size="12.482")
new_text_elem.text = "".join(fulltext).strip()
# Append the new 'text' element to its parent
textline.append(new_text_elem)
print(ET.tostring(MY_XML.getroot(), encoding="unicode"))
输出:
<pages>
<page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
<textbox id="0" bbox="179.739,592.028,261.007,604.510">
<textline bbox="179.739,592.028,261.007,604.510">
<text font="NUMPTY+ImprintMTnum" ncolour="0" size="12.482">CAPITOLO III</text></textline>
</textbox>
</page>
</pages>
我有一个 XML 结构如下:
<?xml version="1.0" encoding="utf-8" ?>
<pages>
<page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
<textbox id="0" bbox="179.739,592.028,261.007,604.510">
<textline bbox="179.739,592.028,261.007,604.510">
<text font="NUMPTY+ImprintMTnum" ncolour="0" size="12.482">C</text>
<text font="NUMPTY+ImprintMTnum" ncolour="0" size="12.482">A</text>
<text font="NUMPTY+ImprintMTnum" ncolour="0" size="12.482">P</text>
<text font="NUMPTY+ImprintMTnum" ncolour="0" size="12.482">I</text>
<text font="NUMPTY+ImprintMTnum" ncolour="0" size="12.482">T</text>
<text font="NUMPTY+ImprintMTnum" ncolour="0" size="12.482">O</text>
<text font="NUMPTY+ImprintMTnum" ncolour="0" size="12.482">L</text>
<text font="NUMPTY+ImprintMTnum" ncolour="0" size="12.482">O</text>
<text> </text>
<text font="NUMPTY+ImprintMTnum" ncolour="0" size="12.482">I</text>
<text font="NUMPTY+ImprintMTnum" ncolour="0" size="12.482">I</text>
<text font="NUMPTY+ImprintMTnum" ncolour="0" size="12.482">I</text>
<text>
</text>
</textline>
</textbox>
</page>
</pages>
我想在同一父级(文本行)中合并所有具有相同文本大小的 text
标签,以便连接各个字母。标签页面、页面和文本框将被保留。我想保持字母出现的顺序,像这样:
<?xml version="1.0" encoding="utf-8" ?>
<pages>
<page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
<textbox id="0" bbox="179.739,592.028,261.007,604.510">
<textline bbox="179.739,592.028,261.007,604.510">
<text font="NUMPTY+ImprintMTnum" ncolour="0" size="12.482">CAPITOLO III</text>
</textline>
</textbox>
</page>
</pages>
我尝试在互联网上查找,但没有成功。这是我尝试过的:
import xml.etree.ElementTree as ET
MY_XML = ET.parse('fe.xml')
group_list = MY_XML.findall("./pages/page/textbox/textline") # I do this because the actual xml is bigger with several groups
text_list = []
for group in group_list:
string_text = ""
for child in group :
for super_child in child:
if(super_child.text is not None): #Just in case None value because I cannot use string addition
string_text = string_text + super_child.text + " "
text_list.append(string_text)
#I stored all the info in 1 group as a value in this list because like I stated my overall xml might be bigger with more than 1 group
for group in group_list:
for elem in group.findall("./pages/page/textbox/textline/text"):
#loop over all possible <group> and removes all <group_info> inside
group.remove(elem)
#And finally to append the information gathered:
for group in group_list:
Text_elem = ET.Element("text")
Text_elem.text = text_list[group_list.index(group)]
group.append(Text_elem)
print(group_list)
不知道怎么弄,求大神指点。
代码中的一个问题是 MY_XML.findall("./pages/page/textbox/textline")
returns 是一个空列表。根元素是 pages
,它是 findall()
的上下文。所以 findall("./page/textbox/textline")
会起作用。
这是一个产生所需输出的程序:
import xml.etree.ElementTree as ET
MY_XML = ET.parse('fe.xml')
textlines = MY_XML.findall("./page/textbox/textline")
for textline in textlines:
fulltext = []
for text_elem in list(textline):
# Get the text of each 'text' element and then remove it
fulltext.append(text_elem.text)
textline.remove(text_elem)
# Create a new 'text' element and add the joined letters to it
new_text_elem = ET.Element("text", font="NUMPTY+ImprintMTnum", ncolour="0", size="12.482")
new_text_elem.text = "".join(fulltext).strip()
# Append the new 'text' element to its parent
textline.append(new_text_elem)
print(ET.tostring(MY_XML.getroot(), encoding="unicode"))
输出:
<pages>
<page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
<textbox id="0" bbox="179.739,592.028,261.007,604.510">
<textline bbox="179.739,592.028,261.007,604.510">
<text font="NUMPTY+ImprintMTnum" ncolour="0" size="12.482">CAPITOLO III</text></textline>
</textbox>
</page>
</pages>