XML: 删除不需要的标签但保留文本内容
XML: delete unwanted tags but keep text content
我正在尝试用太多标签整理语料库。为此,我想过滤掉 out/remove 无用的标签,但保留文本内容。我在使用 xml 方面还很陌生,我试过的代码都没有用。语料库看起来像这样:
<corpus>
<dialogue speaker="A">
<sentence tag1="a" tag2="b"> Hello </sentence>
</dialogue>
<dialogue speaker="B">
<sentence tag1="cc" tag2= "dd"> How are you </sentence>
<sentence tag1="ff" tag2= "e"> today </sentence>
</dialogue>
<dialogue speaker="A">
<sentence tag1="d" tag2= "bbb"> Great </sentence>
<sentence tag1="f" tag2= "dd"> How about you </sentence>
</dialogue>
<dialogue speaker="B">
<sentence tag1="a" tag2= "dd"> me too </sentence>
</dialogue>
</corpus>
理想的结果应该是:
<corpus>
<dialogue speaker="A">
<sentence tag1="a" tag2="b"> Hello </sentence>
</dialogue>
<dialogue speaker="B">
<sentence tag1="cc" tag2= "dd"> How are you today </sentence>
</dialogue>
<dialogue speaker="A">
Great How about you
</dialogue>
<dialogue speaker="B">
<sentence tag1="a" tag2= "dd"> me too </sentence>
</dialogue>
</corpus>
我尝试的第一个代码是这个,但它一直给我一个错误 strip_tags()
:
f = ET.parse("file.xml")
root = f.getroot()
def filter_by(f, tag_list):
for elem in root.iter('dialogue'):
for start in elem.iter('sentence'):
print(sentence.attrib)
if tag_list in root.findall('.//sentence[@tag1]'):
pass
else:
etree.strip_tags(f, 'sentence')
return f
filter_by(f, ["a"])
f.write("output.xml")
由于我需要保留的标签不止一个,我尝试的另一个选项是这个,但它仍然在 if 语句中给我一个错误:
f = ET.parse("file.xml")
root = f.getroot()
tags_want = ["a", "cc"]
for child in root.iter('sentence'):
attrib = child.get("tag1")
if attrib not in tags_want:
etree.strip_tags(f,'sentence')
f.write("output.xml")
有人可以帮助我吗?
我会用这两种方式中的一种来做。首先,像您一样使用 ElementTree 和 xpath:
for dia in root.findall('.//dialogue'):
if len(dia.findall('./sentence'))>1:
new_text = "".join([t.text for t in dia.findall('.//sentence')])
dia.find('.//sentence').text=new_text
for to_delete in dia.findall('./sentence')[1:]:
to_delete.clear()
print(ET.tostring(root).decode())
其次,虽然在您的样本 xml 的情况下,它可能不会产生很大的不同,但我会使用 lxml 而不是 ElementTree,因为前者更好的 xpath 支持:
from lxml import etree
root = etree.parse('file.xml')
for dia in root.xpath('//dialogue'):
if (dia.xpath('count(./sentence)'))>1:
new_text = "".join(dia.xpath('.//sentence//text()')).strip()
dia.xpath('.//sentence')[0].text=new_text
for to_delete in dia.xpath('.//sentence[position()>1]'):
to_delete.getparent().remove(to_delete)
print(etree.tostring(root).decode())
无论哪种情况,输出都应该是
<corpus>
<dialogue speaker="A">
<sentence tag1="a" tag2="b"> Hello </sentence>
</dialogue>
<dialogue speaker="B">
<sentence tag1="cc" tag2="dd">How are you today</sentence>
</dialogue>
<dialogue speaker="A">
<sentence tag1="d" tag2="bbb">Great How about you</sentence>
</dialogue>
<dialogue speaker="B">
<sentence tag1="a" tag2="dd"> me too </sentence>
</dialogue>
</corpus>
我正在尝试用太多标签整理语料库。为此,我想过滤掉 out/remove 无用的标签,但保留文本内容。我在使用 xml 方面还很陌生,我试过的代码都没有用。语料库看起来像这样:
<corpus>
<dialogue speaker="A">
<sentence tag1="a" tag2="b"> Hello </sentence>
</dialogue>
<dialogue speaker="B">
<sentence tag1="cc" tag2= "dd"> How are you </sentence>
<sentence tag1="ff" tag2= "e"> today </sentence>
</dialogue>
<dialogue speaker="A">
<sentence tag1="d" tag2= "bbb"> Great </sentence>
<sentence tag1="f" tag2= "dd"> How about you </sentence>
</dialogue>
<dialogue speaker="B">
<sentence tag1="a" tag2= "dd"> me too </sentence>
</dialogue>
</corpus>
理想的结果应该是:
<corpus>
<dialogue speaker="A">
<sentence tag1="a" tag2="b"> Hello </sentence>
</dialogue>
<dialogue speaker="B">
<sentence tag1="cc" tag2= "dd"> How are you today </sentence>
</dialogue>
<dialogue speaker="A">
Great How about you
</dialogue>
<dialogue speaker="B">
<sentence tag1="a" tag2= "dd"> me too </sentence>
</dialogue>
</corpus>
我尝试的第一个代码是这个,但它一直给我一个错误 strip_tags()
:
f = ET.parse("file.xml")
root = f.getroot()
def filter_by(f, tag_list):
for elem in root.iter('dialogue'):
for start in elem.iter('sentence'):
print(sentence.attrib)
if tag_list in root.findall('.//sentence[@tag1]'):
pass
else:
etree.strip_tags(f, 'sentence')
return f
filter_by(f, ["a"])
f.write("output.xml")
由于我需要保留的标签不止一个,我尝试的另一个选项是这个,但它仍然在 if 语句中给我一个错误:
f = ET.parse("file.xml")
root = f.getroot()
tags_want = ["a", "cc"]
for child in root.iter('sentence'):
attrib = child.get("tag1")
if attrib not in tags_want:
etree.strip_tags(f,'sentence')
f.write("output.xml")
有人可以帮助我吗?
我会用这两种方式中的一种来做。首先,像您一样使用 ElementTree 和 xpath:
for dia in root.findall('.//dialogue'):
if len(dia.findall('./sentence'))>1:
new_text = "".join([t.text for t in dia.findall('.//sentence')])
dia.find('.//sentence').text=new_text
for to_delete in dia.findall('./sentence')[1:]:
to_delete.clear()
print(ET.tostring(root).decode())
其次,虽然在您的样本 xml 的情况下,它可能不会产生很大的不同,但我会使用 lxml 而不是 ElementTree,因为前者更好的 xpath 支持:
from lxml import etree
root = etree.parse('file.xml')
for dia in root.xpath('//dialogue'):
if (dia.xpath('count(./sentence)'))>1:
new_text = "".join(dia.xpath('.//sentence//text()')).strip()
dia.xpath('.//sentence')[0].text=new_text
for to_delete in dia.xpath('.//sentence[position()>1]'):
to_delete.getparent().remove(to_delete)
print(etree.tostring(root).decode())
无论哪种情况,输出都应该是
<corpus>
<dialogue speaker="A">
<sentence tag1="a" tag2="b"> Hello </sentence>
</dialogue>
<dialogue speaker="B">
<sentence tag1="cc" tag2="dd">How are you today</sentence>
</dialogue>
<dialogue speaker="A">
<sentence tag1="d" tag2="bbb">Great How about you</sentence>
</dialogue>
<dialogue speaker="B">
<sentence tag1="a" tag2="dd"> me too </sentence>
</dialogue>
</corpus>