在 Python 中循环遍历 XML
Loop through XML in Python
我的数据集如下:
<?xml version="1.0" encoding="UTF-8"?>
<depts xmlns="http://SOMELINK"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
date="2021-01-15">
<dept dept_id="00001"
col_two="00001value"
col_three="00001false"
name = "some_name">
<owners>
<currentowner col_four="00001value"
col_five="00001value"
col_six="00001false"
name = "some_name">
<addr col_seven="00001value"
col_eight="00001value"
col_nine="00001false"/>
</currentowner>
<currentowner col_four="00001bvalue"
col_five="00001bvalue"
col_six="00001bfalse"
name = "some_name">
<addr col_seven="00001bvalue"
col_eight="00001bvalue"
col_nine="00001bfalse"/>
</currentowner>
</owners>
</dept>
<dept dept_id="00002"
col_two="00002value"
col_three="00002value"
name = "some_name">
<owners>
<currentowner col_four="00002value"
col_five="00002value"
col_six="00002false"
name = "some_name">
<addr col_seven="00002value"
col_eight="00002value"
col_nine="00002false"/>
</currentowner>
</owners>
</dept>
</depts>
目前我有两个循环,一个遍历 child
数据,另一个遍历 granchild
import pandas
import xml.etree.ElementTree as element_tree
from xml.etree.ElementTree import parse
tree = element_tree.parse('<HERE_GOES_XML>')
root = tree.getroot()
name_space = {'ns0': 'http://SOMELINK'}
#root
date_from = root.attrib['date']
print(date_from)
#child
for pharma in root.findall('.//ns0:dept', name_space):
for key, value in pharma.items():
print(key +': ' + value)
#granchild, this must be merged to above so entire script will iterate through entire dept node to move to the next
for owner in root.findall('.//ns0:dept/ns0:owners/ns0:currentowner', name_space):
owner_dict = {}
for key, value in owner.items():
print(key +': ' + value)
当前结果是:
2021-01-15
dept_id: 00001
col_two: 00001value
col_three: 00001false
dept_id: 00002
col_two: 00002value
col_three: 00002value
col_four: 00001value
col_five: 00001value
col_six: 00001false
col_four: 00002value
col_five: 00002value
col_six: 00002false
我的目标是嵌套外观,它首先会迭代整个 dept
子项及其孙子项,然后才移动到下一个。预期结果将低于设置,稍后将转换为 pandas'
数据框(接下来我将尝试处理此问题)。某些列在 child/granchild 之间具有相同的名称,因此需要前缀或仅循环特定的 children
.
dept.dept_id: 00001
dept.col_two: 00001value
dept.col_three: 00001false
dept.name: some_name
currentowner.col_four: 00001value
currentowner.col_five: 00001value
currentowner.col_six: 00001false
currentowner.name: some_name
currentowner.col_four: 00001bvalue
currentowner.col_five: 00001bvalue
currentowner.col_six: 00001bfalse
currentowner.name: some_name
addr.col_seven: 00001value
addr.col_eight: 00001value
addr.col_nine: 00001false
dept.dept_id: 00002
dept.col_two: 00002value
dept.col_three: 00002value
dept.name: some_name
currentowner.col_four: 00002value
currentowner.col_five: 00002value
currentowner.col_six: 00002false
currentowner.name: some_name
addr.col_seven: 00002value
addr.col_eight: 00002value
addr.col_nine: 00002false
[更新] - 我遇到了 zip
应该可以解决问题。
dept_list = []
for item in root.iterfind('.//ns0:dept', name_space):
#print(item.attrib)
dept_list.append(item.attrib)
#print(dept_list)
owner_list = []
for item in root.iterfind('.//ns0:dept/ns0:owners/ns0:currentowner', name_space):
#print(item.attrib)
owner_list.append(item.attrib)
#print(owner_list)
zipped = zip(dept_list, owner_list)
您可以执行深度优先搜索:
root = ElementTree.parse('data.xml').getroot()
ns = {'ns0': 'http://SOMELINK'}
date_from = root.get('date')
print(f'{date_from=}')
for dept in root.findall(f'./ns0:dept', ns):
for key, value in dept.items():
print(f'{key}: {value}')
for node in dept.findall('.//*'):
for key, value in node.items():
print(f'{key}: {value}')
print()
循环可以在列表理解中完成,然后通过导航 DOM 构建字典。以下代码直接进入数据框。
xml = """<depts xmlns="http://SOMELINK"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
date="2021-01-15">
<dept dept_id="00001"
col_two="00001value"
col_three="00001false">
<owners>
<currentowner col_four="00001value"
col_five="00001value"
col_six="00001false">
<addr col_seven="00001value"
col_eight="00001value"
col_nine="00001false"/>
</currentowner>
</owners>
</dept>
<dept dept_id="00002"
col_two="00002value"
col_three="00002value">
<owners>
<currentowner col_four="00002value"
col_five="00002value"
col_six="00002false">
<addr col_seven="00002value"
col_eight="00002value"
col_nine="00002false"/>
</currentowner>
</owners>
</dept>
</depts>"""
import xml.etree.ElementTree as ET
import pandas as pd
root = ET.fromstring(xml)
root.attrib
ns = {'ns0': 'http://SOMELINK'}
pd.DataFrame([{**d.attrib,
**d.find("ns0:owners/ns0:currentowner", ns).attrib,
**d.find("ns0:owners/ns0:currentowner/ns0:addr", ns).attrib}
for d in root.findall("ns0:dept", ns)
])
更安全的版本
如果任何 部门 没有 currentowner 或 currentowner/addr 使用 .attrib
会失败。考虑到这些元素是可选的,走 DOM。 dict
键结构更改为基于元素标签和属性名称的名称。根据您的数据设计构建理解的结构方式。需要考虑1对1,1对可选,1对多。真的可以追溯到 Codd 1970 年写的论文
import xml.etree.ElementTree as ET
import pandas as pd
root = ET.fromstring(xml)
ns = {'ns0': 'http://SOMELINK'}
pd.DataFrame([{**{f"{d.tag.split('}')[1]}.{k}":v for k,v in d.items()},
**{f"{co.tag.split('}')[1]}.{k}":v for k,v in co.items()},
**{f"{addr.tag.split('}')[1]}.{k}":v for addr in co.findall("ns0:addr", ns) for k,v in addr.items()} }
for d in root.findall("ns0:dept", ns)
for co in d.findall("ns0:owners/ns0:currentowner", ns)
])
我的数据集如下:
<?xml version="1.0" encoding="UTF-8"?>
<depts xmlns="http://SOMELINK"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
date="2021-01-15">
<dept dept_id="00001"
col_two="00001value"
col_three="00001false"
name = "some_name">
<owners>
<currentowner col_four="00001value"
col_five="00001value"
col_six="00001false"
name = "some_name">
<addr col_seven="00001value"
col_eight="00001value"
col_nine="00001false"/>
</currentowner>
<currentowner col_four="00001bvalue"
col_five="00001bvalue"
col_six="00001bfalse"
name = "some_name">
<addr col_seven="00001bvalue"
col_eight="00001bvalue"
col_nine="00001bfalse"/>
</currentowner>
</owners>
</dept>
<dept dept_id="00002"
col_two="00002value"
col_three="00002value"
name = "some_name">
<owners>
<currentowner col_four="00002value"
col_five="00002value"
col_six="00002false"
name = "some_name">
<addr col_seven="00002value"
col_eight="00002value"
col_nine="00002false"/>
</currentowner>
</owners>
</dept>
</depts>
目前我有两个循环,一个遍历 child
数据,另一个遍历 granchild
import pandas
import xml.etree.ElementTree as element_tree
from xml.etree.ElementTree import parse
tree = element_tree.parse('<HERE_GOES_XML>')
root = tree.getroot()
name_space = {'ns0': 'http://SOMELINK'}
#root
date_from = root.attrib['date']
print(date_from)
#child
for pharma in root.findall('.//ns0:dept', name_space):
for key, value in pharma.items():
print(key +': ' + value)
#granchild, this must be merged to above so entire script will iterate through entire dept node to move to the next
for owner in root.findall('.//ns0:dept/ns0:owners/ns0:currentowner', name_space):
owner_dict = {}
for key, value in owner.items():
print(key +': ' + value)
当前结果是:
2021-01-15
dept_id: 00001
col_two: 00001value
col_three: 00001false
dept_id: 00002
col_two: 00002value
col_three: 00002value
col_four: 00001value
col_five: 00001value
col_six: 00001false
col_four: 00002value
col_five: 00002value
col_six: 00002false
我的目标是嵌套外观,它首先会迭代整个 dept
子项及其孙子项,然后才移动到下一个。预期结果将低于设置,稍后将转换为 pandas'
数据框(接下来我将尝试处理此问题)。某些列在 child/granchild 之间具有相同的名称,因此需要前缀或仅循环特定的 children
.
dept.dept_id: 00001
dept.col_two: 00001value
dept.col_three: 00001false
dept.name: some_name
currentowner.col_four: 00001value
currentowner.col_five: 00001value
currentowner.col_six: 00001false
currentowner.name: some_name
currentowner.col_four: 00001bvalue
currentowner.col_five: 00001bvalue
currentowner.col_six: 00001bfalse
currentowner.name: some_name
addr.col_seven: 00001value
addr.col_eight: 00001value
addr.col_nine: 00001false
dept.dept_id: 00002
dept.col_two: 00002value
dept.col_three: 00002value
dept.name: some_name
currentowner.col_four: 00002value
currentowner.col_five: 00002value
currentowner.col_six: 00002false
currentowner.name: some_name
addr.col_seven: 00002value
addr.col_eight: 00002value
addr.col_nine: 00002false
[更新] - 我遇到了 zip
应该可以解决问题。
dept_list = []
for item in root.iterfind('.//ns0:dept', name_space):
#print(item.attrib)
dept_list.append(item.attrib)
#print(dept_list)
owner_list = []
for item in root.iterfind('.//ns0:dept/ns0:owners/ns0:currentowner', name_space):
#print(item.attrib)
owner_list.append(item.attrib)
#print(owner_list)
zipped = zip(dept_list, owner_list)
您可以执行深度优先搜索:
root = ElementTree.parse('data.xml').getroot()
ns = {'ns0': 'http://SOMELINK'}
date_from = root.get('date')
print(f'{date_from=}')
for dept in root.findall(f'./ns0:dept', ns):
for key, value in dept.items():
print(f'{key}: {value}')
for node in dept.findall('.//*'):
for key, value in node.items():
print(f'{key}: {value}')
print()
循环可以在列表理解中完成,然后通过导航 DOM 构建字典。以下代码直接进入数据框。
xml = """<depts xmlns="http://SOMELINK"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
date="2021-01-15">
<dept dept_id="00001"
col_two="00001value"
col_three="00001false">
<owners>
<currentowner col_four="00001value"
col_five="00001value"
col_six="00001false">
<addr col_seven="00001value"
col_eight="00001value"
col_nine="00001false"/>
</currentowner>
</owners>
</dept>
<dept dept_id="00002"
col_two="00002value"
col_three="00002value">
<owners>
<currentowner col_four="00002value"
col_five="00002value"
col_six="00002false">
<addr col_seven="00002value"
col_eight="00002value"
col_nine="00002false"/>
</currentowner>
</owners>
</dept>
</depts>"""
import xml.etree.ElementTree as ET
import pandas as pd
root = ET.fromstring(xml)
root.attrib
ns = {'ns0': 'http://SOMELINK'}
pd.DataFrame([{**d.attrib,
**d.find("ns0:owners/ns0:currentowner", ns).attrib,
**d.find("ns0:owners/ns0:currentowner/ns0:addr", ns).attrib}
for d in root.findall("ns0:dept", ns)
])
更安全的版本
如果任何 部门 没有 currentowner 或 currentowner/addr 使用 .attrib
会失败。考虑到这些元素是可选的,走 DOM。 dict
键结构更改为基于元素标签和属性名称的名称。根据您的数据设计构建理解的结构方式。需要考虑1对1,1对可选,1对多。真的可以追溯到 Codd 1970 年写的论文
import xml.etree.ElementTree as ET
import pandas as pd
root = ET.fromstring(xml)
ns = {'ns0': 'http://SOMELINK'}
pd.DataFrame([{**{f"{d.tag.split('}')[1]}.{k}":v for k,v in d.items()},
**{f"{co.tag.split('}')[1]}.{k}":v for k,v in co.items()},
**{f"{addr.tag.split('}')[1]}.{k}":v for addr in co.findall("ns0:addr", ns) for k,v in addr.items()} }
for d in root.findall("ns0:dept", ns)
for co in d.findall("ns0:owners/ns0:currentowner", ns)
])