如何删除部分 XML 数据并将其写入新文件 Python
How to delete parts of XML data and write it to a new file with Python
我有如下的数据结构。输入文件非常大,因此我正在尝试找到一种有效的方法。
<?xml version='1.0' encoding='UTF-8'?>
<corpus name="corpus">
<recording audio="audio.wav" name="first audio">
<segment name="1" start="0" end="2">
<orth>some text 1</orth>
</segment>
<segment name="2" start="2" end="4">
<orth>some text 2</orth>
</segment>
<segment name="3" start="4" end="6">
<orth>some text 3</orth>
</segment>
</recording>
</corpus>
给定一个包含多个文件的输入文件,例如
1
3
它会删除包含 name
的片段。例如,给出了 1 和 3,因此名称为 1 和 3 的段已被删除。
<?xml version='1.0' encoding='UTF-8'?>
<corpus name="corpus">
<recording audio="audio.wav" name="first audio">
<segment name="2" start="2" end="4">
<orth>some text 2</orth>
</segment>
</recording>
</corpus>
我目前的代码
from lxml import etree
with open("g.xml", "r") as xml_file:
xml_data = xml_file.read()
with open('del_names.txt', 'r') as file:
list_of_names = file.read().split("\n")
new_xml = xml_data
for each_name in list_of_names:
print(each_name)
tree = etree.XML(new_xml.encode())
find_segments = tree.xpath("*//segment[@name='{}']".format(each_name))
for each_segment in find_segments:
each_segment.getparent().remove(each_segment)
new_xml = str(etree.tostring(tree, pretty_print=True, xml_declaration=True), encoding="utf-8")
print(new_xml)
代码的问题是,我现在 运行 代码 2 小时了,它甚至没有输出一行。我不确定我可以用什么有效的方法来做到这一点。
我该如何做到这一点?我也认为有 2 个可能是不必要的,对吗?
如果您的代码花费的时间比预期的要长,您始终可以从一些打印语句开始,以便更好地了解时间是否花费了。
对于您的任务,一个循环就足够了。遍历 xml 文件中的所有 'segment' 元素。当段的名称包含在 del_names.txt 文件中时,将其删除。
为了更快地查找姓名,我将姓名列表转换为 set
。
from lxml import etree
with open("g.xml", "r") as xml_file:
xml_data = xml_file.read()
print("read xml data")
with open('del_names.txt', 'r') as file:
names_to_delete = set(file.read().split("\n"))
print("read text data")
new_xml = xml_data
tree = etree.XML(new_xml.encode())
for segment in tree.xpath("*//segment"):
name = segment.attrib.get("name")
if name in names_to_delete:
print(f"will delete segment '{name}'")
segment.getparent().remove(segment)
print(" result ".center(80, "="))
new_xml = str(etree.tostring(tree, encoding="unicode", pretty_print=True))
print(new_xml)
输出:
read xml data
read text data
will delete segment '1'
will delete segment '3'
==================================== result ====================================
<?xml version='1.0' encoding='ASCII'?>
<corpus name="corpus">
<recording audio="audio.wav" name="first audio">
<segment name="2" start="2" end="4">
<orth>some text 2</orth>
</segment>
</recording>
</corpus>
你也可以使用BeautifulSoup
:
from bs4 import BeautifulSoup
my_string = """ <?xml version='1.0' encoding='UTF-8'?>
<corpus name="corpus">
<recording audio="audio.wav" name="first audio">
<segment name="1" start="0" end="2">
<orth>some text 1</orth>
</segment>
<segment name="2" start="2" end="4">
<orth>some text 2</orth>
</segment>
<segment name="3" start="4" end="6">
<orth>some text 3</orth>
</segment>
</recording>
</corpus> """
soup = BeautifulSoup(my_string, 'html.parser')
ids = [1,3] #IDs to delete
for id in ids:
elements = soup.find_all("segment", attrs = {"name": str(id)})
for element in elements:
element.decompose()
print(soup.prettify())
我有如下的数据结构。输入文件非常大,因此我正在尝试找到一种有效的方法。
<?xml version='1.0' encoding='UTF-8'?>
<corpus name="corpus">
<recording audio="audio.wav" name="first audio">
<segment name="1" start="0" end="2">
<orth>some text 1</orth>
</segment>
<segment name="2" start="2" end="4">
<orth>some text 2</orth>
</segment>
<segment name="3" start="4" end="6">
<orth>some text 3</orth>
</segment>
</recording>
</corpus>
给定一个包含多个文件的输入文件,例如
1
3
它会删除包含 name
的片段。例如,给出了 1 和 3,因此名称为 1 和 3 的段已被删除。
<?xml version='1.0' encoding='UTF-8'?>
<corpus name="corpus">
<recording audio="audio.wav" name="first audio">
<segment name="2" start="2" end="4">
<orth>some text 2</orth>
</segment>
</recording>
</corpus>
我目前的代码
from lxml import etree
with open("g.xml", "r") as xml_file:
xml_data = xml_file.read()
with open('del_names.txt', 'r') as file:
list_of_names = file.read().split("\n")
new_xml = xml_data
for each_name in list_of_names:
print(each_name)
tree = etree.XML(new_xml.encode())
find_segments = tree.xpath("*//segment[@name='{}']".format(each_name))
for each_segment in find_segments:
each_segment.getparent().remove(each_segment)
new_xml = str(etree.tostring(tree, pretty_print=True, xml_declaration=True), encoding="utf-8")
print(new_xml)
代码的问题是,我现在 运行 代码 2 小时了,它甚至没有输出一行。我不确定我可以用什么有效的方法来做到这一点。
我该如何做到这一点?我也认为有 2 个可能是不必要的,对吗?
如果您的代码花费的时间比预期的要长,您始终可以从一些打印语句开始,以便更好地了解时间是否花费了。
对于您的任务,一个循环就足够了。遍历 xml 文件中的所有 'segment' 元素。当段的名称包含在 del_names.txt 文件中时,将其删除。
为了更快地查找姓名,我将姓名列表转换为 set
。
from lxml import etree
with open("g.xml", "r") as xml_file:
xml_data = xml_file.read()
print("read xml data")
with open('del_names.txt', 'r') as file:
names_to_delete = set(file.read().split("\n"))
print("read text data")
new_xml = xml_data
tree = etree.XML(new_xml.encode())
for segment in tree.xpath("*//segment"):
name = segment.attrib.get("name")
if name in names_to_delete:
print(f"will delete segment '{name}'")
segment.getparent().remove(segment)
print(" result ".center(80, "="))
new_xml = str(etree.tostring(tree, encoding="unicode", pretty_print=True))
print(new_xml)
输出:
read xml data
read text data
will delete segment '1'
will delete segment '3'
==================================== result ====================================
<?xml version='1.0' encoding='ASCII'?>
<corpus name="corpus">
<recording audio="audio.wav" name="first audio">
<segment name="2" start="2" end="4">
<orth>some text 2</orth>
</segment>
</recording>
</corpus>
你也可以使用BeautifulSoup
:
from bs4 import BeautifulSoup
my_string = """ <?xml version='1.0' encoding='UTF-8'?>
<corpus name="corpus">
<recording audio="audio.wav" name="first audio">
<segment name="1" start="0" end="2">
<orth>some text 1</orth>
</segment>
<segment name="2" start="2" end="4">
<orth>some text 2</orth>
</segment>
<segment name="3" start="4" end="6">
<orth>some text 3</orth>
</segment>
</recording>
</corpus> """
soup = BeautifulSoup(my_string, 'html.parser')
ids = [1,3] #IDs to delete
for id in ids:
elements = soup.find_all("segment", attrs = {"name": str(id)})
for element in elements:
element.decompose()
print(soup.prettify())