使用 Python 中的 ElementTree 从 XML 中提取数据
Extract data from XML using ElementTree in Python
我有以下 XML 文件,我必须在 csv 文件中解析和提取数据。在这个文件中,我有两个盒子(box_id),它们被打包在两个不同的父对象(parent_box_id)上,并且还有每个盒子内容的详细信息(element sgtin -> info_sgtin).
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<doc>
<info id_reference="2">
<data_down>
<tree>
<box_id>046071598600870568</box_id>
<parent_box_id>046071598600875594</parent_box_id>
</tree>
<tree>
<box_id>046071598600870575</box_id>
<parent_box_id>046071598600875595</parent_box_id>
</tree>
<tree>
<sgtin>
<info_sgtin>
<sgtin>04607008133585B0SE1HVHBGR3A</sgtin>
<box_id>046071598600870568</box_id>
<gtin>04607008133585</gtin>
<series_number>026A</series_number>
</info_sgtin>
</sgtin>
<parent_box_id>046071598600870568</parent_box_id>
</tree>
<tree>
<sgtin>
<info_sgtin>
<sgtin>046070081335856F7P78HBVBEH2</sgtin>
<box_id>046071598600870568</box_id>
<gtin>04607008133585</gtin>
<series_number>026A</series_number>
</info_sgtin>
</sgtin>
<parent_box_id>046071598600870568</parent_box_id>
</tree>
<tree>
<sgtin>
<info_sgtin>
<sgtin>046070081335854T61H7CSXDE9W</sgtin>
<box_id>046071598600870575</box_id>
<gtin>04607008133585</gtin>
<series_number>026A</series_number>
</info_sgtin>
</sgtin>
<parent_box_id>046071598600870575</parent_box_id>
</tree>
</data_down>
</info>
</doc>
为此,我决定在 Python 中使用 Elementtree,但问题是在我的 XML 文件中,我有两种标签变体。
首先我遍历所有细节并捕获 box_id 值,但之后我必须转到父项并获取 parent_box_id 其中 box_id 打包。
换句话说我想通过以下方式获取数据:
parent_box_id box_id sgtin series_number
046071598600875594 046071598600870568 04607008133585B0SE1HVHBGR3A 026A
046071598600875594 046071598600870568 046070081335856F7P78HBVBEH2 026A
046071598600875595 046071598600870575 046070081335854T61H7CSXDE9W 026A
但我不知道如何获取 parent_box_id 值。感谢社区的任何支持。
这是我的代码:
import csv
import xml.etree.ElementTree as ET
csv.writer(open('result.csv','w'),delimiter=';', quotechar='"', quoting=csv.QUOTE_MINIMAL))
tree = ET.parse('test.xml')
root = tree.getroot()
with open('result.csv','a',newline='') as myfile:
writer = csv.writer(myfile, delimiter=';', quotechar='"', quoting=csv.QUOTE_MINIMAL)
for alist in root.iter('info_sgtin'):
sgtin = alist.find('sgtin').text
box_id = alist.find('box_id').text
series = alist.find('series_number').text
writer.writerow([sgtin,box_id,series])
这是一个使用 XPATH 的解决方案(首先从 tree
的直接子级收集 box_id
和 parent_box_id
之间的映射)。那是你要找的吗?我不确定,因为 046071598600875595
在您想要的输出中列为 parent_box_id
for box_id
046071598600870575
,我不知道这是从哪里来的。
root = etree.parse(fp, parser)
parent_ids = {elem.text: elem.xpath("following-sibling::parent_box_id")[0].text
for elem in root.xpath("//*/tree/box_id")}
for alist in root.iter('info_sgtin'):
sgtin = alist.find('sgtin').text
box_id = alist.find('box_id').text
series = alist.find('series_number').text
print(sgtin, parent_ids[box_id], box_id, series)
输出:
04607008133585B0SE1HVHBGR3A 046071598600875594 046071598600870568 026A
046070081335856F7P78HBVBEH2 046071598600875594 046071598600870568 026A
046070081335854T61H7CSXDE9W 046071598600875594 046071598600870575 026A
如果您的文件很大并且只遍历它们一次是有意义的,那么您可以将 etree.iterparse
与 tag=["box_id"]
或 tag=["tree"]
一起使用。在前一种情况下,检查您是否观察到您在任一情况下所期望的兄弟姐妹(sgtin
、gtin
、series_number
或 parent_box_id
)。如果您找到 parent_box_id
,则您将新映射添加到查找 table(将 box_id
链接到 parent_box_id
的字典。如果您找到 sgtin
和其他人,写出你从兄弟姐妹那里收集的数据,并从你的查找中得到 parent_box_id
table.
当然,如果结构是 box_id
到 parent_box_id
映射总是先于 sgtin
、box_id
、gtin
和 series_number
.
您需要遍历每个 <tree>
标签并检查是否有您需要的数据。那就收藏吧。
import xml.etree.ElementTree
root = xml.etree.ElementTree.parse('data.xml')
# collect parent data
parent_data = {}
for item in root.iter('tree'):
box_id_match = item.find('box_id')
parent_box_id_match = item.find('parent_box_id')
if box_id_match != None:
parent_data.update({box_id_match.text: parent_box_id_match.text})
data = []
for item in root.iter('tree'):
sgtin = item.find('sgtin/info_sgtin/sgtin')
box_id = item.find('sgtin/info_sgtin/box_id')
series_number = item.find('sgtin/info_sgtin/series_number')
# collect valid data
if sgtin != None and box_id != None and series_number != None:
parent_box_id = parent_data.get(box_id.text)
data.append([parent_box_id, box_id.text, sgtin.text, series_number.text])
输出:
['046071598600875594', '046071598600870568', '04607008133585B0SE1HVHBGR3A', '026A']
['046071598600875594', '046071598600870568', '046070081335856F7P78HBVBEH2', '026A']
['046071598600875595', '046071598600870575', '046070081335854T61H7CSXDE9W', '026A']
试试这个。
from simplified_scrapy import SimplifiedDoc
html = '''
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<doc>
<info id_reference="2">
<data_down>
<tree>
<box_id>046071598600870568</box_id>
<parent_box_id>046071598600875594</parent_box_id>
</tree>
<tree>
<box_id>046071598600870575</box_id>
<parent_box_id>046071598600875594</parent_box_id>
</tree>
<tree>
<sgtin>
<info_sgtin>
<sgtin>04607008133585B0SE1HVHBGR3A</sgtin>
<box_id>046071598600870568</box_id>
<gtin>04607008133585</gtin>
<series_number>026A</series_number>
</info_sgtin>
</sgtin>
<parent_box_id>046071598600870568</parent_box_id>
</tree>
<tree>
<sgtin>
<info_sgtin>
<sgtin>046070081335856F7P78HBVBEH2</sgtin>
<box_id>046071598600870568</box_id>
<gtin>04607008133585</gtin>
<series_number>026A</series_number>
</info_sgtin>
</sgtin>
<parent_box_id>046071598600870568</parent_box_id>
</tree>
<tree>
<sgtin>
<info_sgtin>
<sgtin>046070081335854T61H7CSXDE9W</sgtin>
<box_id>046071598600870575</box_id>
<gtin>04607008133585</gtin>
<series_number>026A</series_number>
</info_sgtin>
</sgtin>
<parent_box_id>046071598600870575</parent_box_id>
</tree>
</data_down>
</info>
</doc>
'''
doc = SimplifiedDoc(html)
boxIds = doc.selects('data_down>tree').notContains('<sgtin>')
dic = {}
for box in boxIds:
dic[box.box_id.html]=box.parent_box_id.html
datas=[]
boxs = doc.selects('data_down>info_sgtin')
for box in boxs:
datas.append([dic[box.box_id.html],box.box_id.html,box.sgtin.html,box.series_number.html])
print (datas)
结果:
[['046071598600875594', '046071598600870568', '04607008133585B0SE1HVHBGR3A', '026A'], ['046071598600875594', '046071598600870568', '046070081335856F7P78HBVBEH2', '026A'], ['046071598600875594', '046071598600870575', '046070081335854T61H7CSXDE9W', '026A']]
我有以下 XML 文件,我必须在 csv 文件中解析和提取数据。在这个文件中,我有两个盒子(box_id),它们被打包在两个不同的父对象(parent_box_id)上,并且还有每个盒子内容的详细信息(element sgtin -> info_sgtin).
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<doc>
<info id_reference="2">
<data_down>
<tree>
<box_id>046071598600870568</box_id>
<parent_box_id>046071598600875594</parent_box_id>
</tree>
<tree>
<box_id>046071598600870575</box_id>
<parent_box_id>046071598600875595</parent_box_id>
</tree>
<tree>
<sgtin>
<info_sgtin>
<sgtin>04607008133585B0SE1HVHBGR3A</sgtin>
<box_id>046071598600870568</box_id>
<gtin>04607008133585</gtin>
<series_number>026A</series_number>
</info_sgtin>
</sgtin>
<parent_box_id>046071598600870568</parent_box_id>
</tree>
<tree>
<sgtin>
<info_sgtin>
<sgtin>046070081335856F7P78HBVBEH2</sgtin>
<box_id>046071598600870568</box_id>
<gtin>04607008133585</gtin>
<series_number>026A</series_number>
</info_sgtin>
</sgtin>
<parent_box_id>046071598600870568</parent_box_id>
</tree>
<tree>
<sgtin>
<info_sgtin>
<sgtin>046070081335854T61H7CSXDE9W</sgtin>
<box_id>046071598600870575</box_id>
<gtin>04607008133585</gtin>
<series_number>026A</series_number>
</info_sgtin>
</sgtin>
<parent_box_id>046071598600870575</parent_box_id>
</tree>
</data_down>
</info>
</doc>
为此,我决定在 Python 中使用 Elementtree,但问题是在我的 XML 文件中,我有两种标签变体。
首先我遍历所有细节并捕获 box_id 值,但之后我必须转到父项并获取 parent_box_id 其中 box_id 打包。
换句话说我想通过以下方式获取数据:
parent_box_id box_id sgtin series_number
046071598600875594 046071598600870568 04607008133585B0SE1HVHBGR3A 026A
046071598600875594 046071598600870568 046070081335856F7P78HBVBEH2 026A
046071598600875595 046071598600870575 046070081335854T61H7CSXDE9W 026A
但我不知道如何获取 parent_box_id 值。感谢社区的任何支持。
这是我的代码:
import csv
import xml.etree.ElementTree as ET
csv.writer(open('result.csv','w'),delimiter=';', quotechar='"', quoting=csv.QUOTE_MINIMAL))
tree = ET.parse('test.xml')
root = tree.getroot()
with open('result.csv','a',newline='') as myfile:
writer = csv.writer(myfile, delimiter=';', quotechar='"', quoting=csv.QUOTE_MINIMAL)
for alist in root.iter('info_sgtin'):
sgtin = alist.find('sgtin').text
box_id = alist.find('box_id').text
series = alist.find('series_number').text
writer.writerow([sgtin,box_id,series])
这是一个使用 XPATH 的解决方案(首先从 tree
的直接子级收集 box_id
和 parent_box_id
之间的映射)。那是你要找的吗?我不确定,因为 046071598600875595
在您想要的输出中列为 parent_box_id
for box_id
046071598600870575
,我不知道这是从哪里来的。
root = etree.parse(fp, parser)
parent_ids = {elem.text: elem.xpath("following-sibling::parent_box_id")[0].text
for elem in root.xpath("//*/tree/box_id")}
for alist in root.iter('info_sgtin'):
sgtin = alist.find('sgtin').text
box_id = alist.find('box_id').text
series = alist.find('series_number').text
print(sgtin, parent_ids[box_id], box_id, series)
输出:
04607008133585B0SE1HVHBGR3A 046071598600875594 046071598600870568 026A
046070081335856F7P78HBVBEH2 046071598600875594 046071598600870568 026A
046070081335854T61H7CSXDE9W 046071598600875594 046071598600870575 026A
如果您的文件很大并且只遍历它们一次是有意义的,那么您可以将 etree.iterparse
与 tag=["box_id"]
或 tag=["tree"]
一起使用。在前一种情况下,检查您是否观察到您在任一情况下所期望的兄弟姐妹(sgtin
、gtin
、series_number
或 parent_box_id
)。如果您找到 parent_box_id
,则您将新映射添加到查找 table(将 box_id
链接到 parent_box_id
的字典。如果您找到 sgtin
和其他人,写出你从兄弟姐妹那里收集的数据,并从你的查找中得到 parent_box_id
table.
当然,如果结构是 box_id
到 parent_box_id
映射总是先于 sgtin
、box_id
、gtin
和 series_number
.
您需要遍历每个 <tree>
标签并检查是否有您需要的数据。那就收藏吧。
import xml.etree.ElementTree
root = xml.etree.ElementTree.parse('data.xml')
# collect parent data
parent_data = {}
for item in root.iter('tree'):
box_id_match = item.find('box_id')
parent_box_id_match = item.find('parent_box_id')
if box_id_match != None:
parent_data.update({box_id_match.text: parent_box_id_match.text})
data = []
for item in root.iter('tree'):
sgtin = item.find('sgtin/info_sgtin/sgtin')
box_id = item.find('sgtin/info_sgtin/box_id')
series_number = item.find('sgtin/info_sgtin/series_number')
# collect valid data
if sgtin != None and box_id != None and series_number != None:
parent_box_id = parent_data.get(box_id.text)
data.append([parent_box_id, box_id.text, sgtin.text, series_number.text])
输出:
['046071598600875594', '046071598600870568', '04607008133585B0SE1HVHBGR3A', '026A']
['046071598600875594', '046071598600870568', '046070081335856F7P78HBVBEH2', '026A']
['046071598600875595', '046071598600870575', '046070081335854T61H7CSXDE9W', '026A']
试试这个。
from simplified_scrapy import SimplifiedDoc
html = '''
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<doc>
<info id_reference="2">
<data_down>
<tree>
<box_id>046071598600870568</box_id>
<parent_box_id>046071598600875594</parent_box_id>
</tree>
<tree>
<box_id>046071598600870575</box_id>
<parent_box_id>046071598600875594</parent_box_id>
</tree>
<tree>
<sgtin>
<info_sgtin>
<sgtin>04607008133585B0SE1HVHBGR3A</sgtin>
<box_id>046071598600870568</box_id>
<gtin>04607008133585</gtin>
<series_number>026A</series_number>
</info_sgtin>
</sgtin>
<parent_box_id>046071598600870568</parent_box_id>
</tree>
<tree>
<sgtin>
<info_sgtin>
<sgtin>046070081335856F7P78HBVBEH2</sgtin>
<box_id>046071598600870568</box_id>
<gtin>04607008133585</gtin>
<series_number>026A</series_number>
</info_sgtin>
</sgtin>
<parent_box_id>046071598600870568</parent_box_id>
</tree>
<tree>
<sgtin>
<info_sgtin>
<sgtin>046070081335854T61H7CSXDE9W</sgtin>
<box_id>046071598600870575</box_id>
<gtin>04607008133585</gtin>
<series_number>026A</series_number>
</info_sgtin>
</sgtin>
<parent_box_id>046071598600870575</parent_box_id>
</tree>
</data_down>
</info>
</doc>
'''
doc = SimplifiedDoc(html)
boxIds = doc.selects('data_down>tree').notContains('<sgtin>')
dic = {}
for box in boxIds:
dic[box.box_id.html]=box.parent_box_id.html
datas=[]
boxs = doc.selects('data_down>info_sgtin')
for box in boxs:
datas.append([dic[box.box_id.html],box.box_id.html,box.sgtin.html,box.series_number.html])
print (datas)
结果:
[['046071598600875594', '046071598600870568', '04607008133585B0SE1HVHBGR3A', '026A'], ['046071598600875594', '046071598600870568', '046070081335856F7P78HBVBEH2', '026A'], ['046071598600875594', '046071598600870575', '046070081335854T61H7CSXDE9W', '026A']]