如何使用 python 从多个 XML 节点和层次结构中提取信息?
How to extract information from multiple XML nodes and hierarchies using python?
我有以下结构:
<population>
<person id="101">
<attributes>
<attribute name="age" class="java.lang.Integer" >53</attribute>
</attributes>
<plan score="-0.38" selected="yes">
<activity type="outside" link="81312" facility="outside_208" x="649324.9906891582" y="6866581.699995641" end_time="08:22:00" >
</activity>
<leg mode="car" dep_time="08:22:00" trav_time="00:10:13">
<route type="links" start_link="81312" end_link="138852" trav_time="00:10:13" distance="6046.54932060571" vehicleRefId="7262234">81312</route>
</leg>
<activity type="work" link="138852" facility="38407" x="651680.6" y="6863892.5" start_time="08:45:22" end_time="17:15:22" >
<attributes>
<attribute name="innerParis" class="java.lang.Boolean" >true</attribute>
</attributes>
</activity>
<leg mode="car" dep_time="17:15:22" trav_time="00:07:05">
<route type="links" start_link="138852" end_link="189898" trav_time="00:07:05" distance="4604.544053407517" vehicleRefId="7262234">138852</route>
</leg>
<activity type="outside" link="189898" facility="outside_249" x="648729.9598002436" y="6866057.250182923" end_time="17:20:35" >
</activity>
</plan>
<plan score="-0.38" selected="no">
<activity type="inside" link="81312" facility="outside_208" x="649324.9906891582" y="6866581.699995641" end_time="08:22:00" >
</activity>
<leg mode="bike" dep_time="08:22:00" trav_time="00:10:13">
<route type="links" start_link="81312" end_link="138852" trav_time="00:10:13" distance="6046.54932060571" vehicleRefId="7262234">81312</route>
</leg>
<activity type="shopping" link="138852" facility="38407" x="651680.6" y="6863892.5" start_time="08:45:22" end_time="17:15:22" >
<attributes>
<attribute name="innerParis" class="java.lang.Boolean" >true</attribute>
</attributes>
</activity>
<leg mode="bike" dep_time="08:22:00" trav_time="00:10:13">
<route type="links" start_link="81312" end_link="138852" trav_time="00:10:13" distance="6046.54932060571" vehicleRefId="7262234">81312</route>
</leg>
<activity type="work" link="138852" facility="38407" x="651680.6" y="6863892.5" start_time="08:45:22" end_time="17:15:22" >
<attributes>
<attribute name="innerParis" class="java.lang.Boolean" >true</attribute>
</attributes>
</activity>
<leg mode="pt" dep_time="17:15:22" trav_time="00:07:05">
<route type="links" start_link="138852" end_link="189898" trav_time="00:07:05" distance="4604.544053407517" vehicleRefId="7262234">138852</route>
</leg>
<activity type="outside" link="189898" facility="outside_249" x="648729.9598002436" y="6866057.250182923" end_time="17:20:35" >
</activity>
</plan>
</person>
<person id="102">
<attributes>
<attribute name="age" class="java.lang.Integer" >53</attribute>
</attributes>
<plan score="-0.38" selected="yes">
<activity type="inside" link="81312" facility="outside_208" x="649324.9906891582" y="6866581.699995641" end_time="08:22:00" >
</activity>
<leg mode="bike" dep_time="08:22:00" trav_time="00:10:13">
<route type="links" start_link="81312" end_link="138852" trav_time="00:10:13" distance="6046.54932060571" vehicleRefId="7262234">81312</route>
</leg>
<activity type="work" link="138852" facility="38407" x="651680.6" y="6863892.5" start_time="08:45:22" end_time="17:15:22" >
<attributes>
<attribute name="innerParis" class="java.lang.Boolean" >true</attribute>
</attributes>
</activity>
<leg mode="bike" dep_time="08:22:00" trav_time="00:10:13">
<route type="links" start_link="81312" end_link="138852" trav_time="00:10:13" distance="6046.54932060571" vehicleRefId="7262234">81312</route>
</leg>
<activity type="work" link="138852" facility="38407" x="651680.6" y="6863892.5" start_time="08:45:22" end_time="17:15:22" >
<attributes>
<attribute name="innerParis" class="java.lang.Boolean" >true</attribute>
</attributes>
</activity>
<leg mode="pt" dep_time="17:15:22" trav_time="00:07:05">
<route type="links" start_link="138852" end_link="189898" trav_time="00:07:05" distance="4604.544053407517" vehicleRefId="7262234">138852</route>
</leg>
<activity type="outside" link="189898" facility="outside_249" x="648729.9598002436" y="6866057.250182923" end_time="17:20:35" >
</activity>
</plan>
</person>
<person id="103">
<attributes>
<attribute name="age" class="java.lang.Integer" >53</attribute>
</attributes>
<plan score="-0.38" selected="yes">
<activity type="inside" link="81312" facility="outside_208" x="649324.9906891582" y="6866581.699995641" end_time="08:22:00" >
</activity>
<leg mode="bike" dep_time="08:22:00" trav_time="00:10:13">
<route type="links" start_link="81312" end_link="138852" trav_time="00:10:13" distance="6046.54932060571" vehicleRefId="7262234">81312</route>
</leg>
<activity type="shopping" link="138852" facility="38407" x="651680.6" y="6863892.5" start_time="08:45:22" end_time="17:15:22" >
<attributes>
<attribute name="innerParis" class="java.lang.Boolean" >true</attribute>
</attributes>
</activity>
<leg mode="bike" dep_time="08:22:00" trav_time="00:10:13">
<route type="links" start_link="81312" end_link="138852" trav_time="00:10:13" distance="6046.54932060571" vehicleRefId="7262234">81312</route>
</leg>
<activity type="work" link="138852" facility="38407" x="651680.6" y="6863892.5" start_time="08:45:22" end_time="17:15:22" >
<attributes>
<attribute name="innerParis" class="java.lang.Boolean" >true</attribute>
</attributes>
</activity>
<leg mode="pt" dep_time="17:15:22" trav_time="00:07:05">
<route type="links" start_link="138852" end_link="189898" trav_time="00:07:05" distance="4604.544053407517" vehicleRefId="7262234">138852</route>
</leg>
<activity type="outside" link="189898" facility="outside_249" x="648729.9598002436" y="6866057.250182923" end_time="17:20:35" >
</activity>
</plan>
</person>
</population>
我想要的是提取 person id
的值,如果 plan selected ="yes"
我还想提取所有 activity type
和 leg mode
。它应该以现有的顺序存储为例如字典(或数据框,这并不重要)。
所以理想的结果应该是这样的:
id leg_activity
101 outside; car; work; car; outside
102 inside; bike; work; bike; work ...
...
到目前为止,我只使用过 JMSPath,我知道它不是最合适的,所以我很高兴看到 elementtree
左右的其他方法:) 另外,我无法找到一种一步提取 activity
和 leg
信息的方法。到目前为止,这是我的方法:
import gzip
import xmltodict
import pandas as pd
import jmespath
box = xmltodict.parse(gzip.open(gzipfile, 'r'))
expression = jmespath.compile('population.person[].plan[?"@selected"==`yes`].activity[*].["@type"]')
coords = expression.search(box)
coords = pd.DataFrame.from_dict(coords)
假设您的 xml 在 test.xml
内,以下应该有效:
from bs4 import BeautifulSoup
import pandas as pd
soup = BeautifulSoup(open('test.xml'), features='lxml')
plan_log = []
for person in soup.find_all('person'):
log = {'id': person.get('id')}
activities = []
for plan in person.find_all('plan', attrs={'selected': 'yes'}):
for detail in plan.children:
if detail.name == 'activity':
activities.append(detail.get('type'))
elif detail.name == 'leg':
activities.append(detail.get('mode'))
# activities.append(detail.get('type') or detail.get('mode'))
log['leg_activity'] = ', '.join(activities)
plan_log.append(log)
df = pd.DataFrame(plan_log)
print(df)
输出:
id leg_activity
0 101 outside, car, work, car, outside
1 102 inside, bike, work, bike, work, pt, outside
2 103 inside, bike, shopping, bike, work, pt, outside
我有以下结构:
<population>
<person id="101">
<attributes>
<attribute name="age" class="java.lang.Integer" >53</attribute>
</attributes>
<plan score="-0.38" selected="yes">
<activity type="outside" link="81312" facility="outside_208" x="649324.9906891582" y="6866581.699995641" end_time="08:22:00" >
</activity>
<leg mode="car" dep_time="08:22:00" trav_time="00:10:13">
<route type="links" start_link="81312" end_link="138852" trav_time="00:10:13" distance="6046.54932060571" vehicleRefId="7262234">81312</route>
</leg>
<activity type="work" link="138852" facility="38407" x="651680.6" y="6863892.5" start_time="08:45:22" end_time="17:15:22" >
<attributes>
<attribute name="innerParis" class="java.lang.Boolean" >true</attribute>
</attributes>
</activity>
<leg mode="car" dep_time="17:15:22" trav_time="00:07:05">
<route type="links" start_link="138852" end_link="189898" trav_time="00:07:05" distance="4604.544053407517" vehicleRefId="7262234">138852</route>
</leg>
<activity type="outside" link="189898" facility="outside_249" x="648729.9598002436" y="6866057.250182923" end_time="17:20:35" >
</activity>
</plan>
<plan score="-0.38" selected="no">
<activity type="inside" link="81312" facility="outside_208" x="649324.9906891582" y="6866581.699995641" end_time="08:22:00" >
</activity>
<leg mode="bike" dep_time="08:22:00" trav_time="00:10:13">
<route type="links" start_link="81312" end_link="138852" trav_time="00:10:13" distance="6046.54932060571" vehicleRefId="7262234">81312</route>
</leg>
<activity type="shopping" link="138852" facility="38407" x="651680.6" y="6863892.5" start_time="08:45:22" end_time="17:15:22" >
<attributes>
<attribute name="innerParis" class="java.lang.Boolean" >true</attribute>
</attributes>
</activity>
<leg mode="bike" dep_time="08:22:00" trav_time="00:10:13">
<route type="links" start_link="81312" end_link="138852" trav_time="00:10:13" distance="6046.54932060571" vehicleRefId="7262234">81312</route>
</leg>
<activity type="work" link="138852" facility="38407" x="651680.6" y="6863892.5" start_time="08:45:22" end_time="17:15:22" >
<attributes>
<attribute name="innerParis" class="java.lang.Boolean" >true</attribute>
</attributes>
</activity>
<leg mode="pt" dep_time="17:15:22" trav_time="00:07:05">
<route type="links" start_link="138852" end_link="189898" trav_time="00:07:05" distance="4604.544053407517" vehicleRefId="7262234">138852</route>
</leg>
<activity type="outside" link="189898" facility="outside_249" x="648729.9598002436" y="6866057.250182923" end_time="17:20:35" >
</activity>
</plan>
</person>
<person id="102">
<attributes>
<attribute name="age" class="java.lang.Integer" >53</attribute>
</attributes>
<plan score="-0.38" selected="yes">
<activity type="inside" link="81312" facility="outside_208" x="649324.9906891582" y="6866581.699995641" end_time="08:22:00" >
</activity>
<leg mode="bike" dep_time="08:22:00" trav_time="00:10:13">
<route type="links" start_link="81312" end_link="138852" trav_time="00:10:13" distance="6046.54932060571" vehicleRefId="7262234">81312</route>
</leg>
<activity type="work" link="138852" facility="38407" x="651680.6" y="6863892.5" start_time="08:45:22" end_time="17:15:22" >
<attributes>
<attribute name="innerParis" class="java.lang.Boolean" >true</attribute>
</attributes>
</activity>
<leg mode="bike" dep_time="08:22:00" trav_time="00:10:13">
<route type="links" start_link="81312" end_link="138852" trav_time="00:10:13" distance="6046.54932060571" vehicleRefId="7262234">81312</route>
</leg>
<activity type="work" link="138852" facility="38407" x="651680.6" y="6863892.5" start_time="08:45:22" end_time="17:15:22" >
<attributes>
<attribute name="innerParis" class="java.lang.Boolean" >true</attribute>
</attributes>
</activity>
<leg mode="pt" dep_time="17:15:22" trav_time="00:07:05">
<route type="links" start_link="138852" end_link="189898" trav_time="00:07:05" distance="4604.544053407517" vehicleRefId="7262234">138852</route>
</leg>
<activity type="outside" link="189898" facility="outside_249" x="648729.9598002436" y="6866057.250182923" end_time="17:20:35" >
</activity>
</plan>
</person>
<person id="103">
<attributes>
<attribute name="age" class="java.lang.Integer" >53</attribute>
</attributes>
<plan score="-0.38" selected="yes">
<activity type="inside" link="81312" facility="outside_208" x="649324.9906891582" y="6866581.699995641" end_time="08:22:00" >
</activity>
<leg mode="bike" dep_time="08:22:00" trav_time="00:10:13">
<route type="links" start_link="81312" end_link="138852" trav_time="00:10:13" distance="6046.54932060571" vehicleRefId="7262234">81312</route>
</leg>
<activity type="shopping" link="138852" facility="38407" x="651680.6" y="6863892.5" start_time="08:45:22" end_time="17:15:22" >
<attributes>
<attribute name="innerParis" class="java.lang.Boolean" >true</attribute>
</attributes>
</activity>
<leg mode="bike" dep_time="08:22:00" trav_time="00:10:13">
<route type="links" start_link="81312" end_link="138852" trav_time="00:10:13" distance="6046.54932060571" vehicleRefId="7262234">81312</route>
</leg>
<activity type="work" link="138852" facility="38407" x="651680.6" y="6863892.5" start_time="08:45:22" end_time="17:15:22" >
<attributes>
<attribute name="innerParis" class="java.lang.Boolean" >true</attribute>
</attributes>
</activity>
<leg mode="pt" dep_time="17:15:22" trav_time="00:07:05">
<route type="links" start_link="138852" end_link="189898" trav_time="00:07:05" distance="4604.544053407517" vehicleRefId="7262234">138852</route>
</leg>
<activity type="outside" link="189898" facility="outside_249" x="648729.9598002436" y="6866057.250182923" end_time="17:20:35" >
</activity>
</plan>
</person>
</population>
我想要的是提取 person id
的值,如果 plan selected ="yes"
我还想提取所有 activity type
和 leg mode
。它应该以现有的顺序存储为例如字典(或数据框,这并不重要)。
所以理想的结果应该是这样的:
id leg_activity
101 outside; car; work; car; outside
102 inside; bike; work; bike; work ...
...
到目前为止,我只使用过 JMSPath,我知道它不是最合适的,所以我很高兴看到 elementtree
左右的其他方法:) 另外,我无法找到一种一步提取 activity
和 leg
信息的方法。到目前为止,这是我的方法:
import gzip
import xmltodict
import pandas as pd
import jmespath
box = xmltodict.parse(gzip.open(gzipfile, 'r'))
expression = jmespath.compile('population.person[].plan[?"@selected"==`yes`].activity[*].["@type"]')
coords = expression.search(box)
coords = pd.DataFrame.from_dict(coords)
假设您的 xml 在 test.xml
内,以下应该有效:
from bs4 import BeautifulSoup
import pandas as pd
soup = BeautifulSoup(open('test.xml'), features='lxml')
plan_log = []
for person in soup.find_all('person'):
log = {'id': person.get('id')}
activities = []
for plan in person.find_all('plan', attrs={'selected': 'yes'}):
for detail in plan.children:
if detail.name == 'activity':
activities.append(detail.get('type'))
elif detail.name == 'leg':
activities.append(detail.get('mode'))
# activities.append(detail.get('type') or detail.get('mode'))
log['leg_activity'] = ', '.join(activities)
plan_log.append(log)
df = pd.DataFrame(plan_log)
print(df)
输出:
id leg_activity
0 101 outside, car, work, car, outside
1 102 inside, bike, work, bike, work, pt, outside
2 103 inside, bike, shopping, bike, work, pt, outside