如何使用 python 根据列表中出现的父值提取子值?
How to extract child values depending on parent values appearence in list, using python?
我有一个 XML 结构如下:
<population desc="Switzerland Baseline">
<attributes>
<attribute name="coordinateReferenceSystem" class="java.lang.String" >Atlantis</attribute>
</attributes>
<person id="1015600">
<attributes>
<attribute name="age" class="java.lang.Integer" >86</attribute>
<attribute name="ptSubscription" class="java.lang.Boolean" >false</attribute>
<attribute name="sex" class="java.lang.String" >f</attribute>
</attributes>
</person>
<person id="10002042">
<attributes>
<attribute name="age" class="java.lang.Integer" >86</attribute>
<attribute name="ptSubscription" class="java.lang.Boolean" >false</attribute>
<attribute name="sex" class="java.lang.String" >f</attribute>
</attributes>
</person>
<person id="1241567">
<attributes>
<attribute name="age" class="java.lang.Integer" >86</attribute>
<attribute name="ptSubscription" class="java.lang.Boolean" >true</attribute>
<attribute name="sex" class="java.lang.String" >f</attribute>
</attributes>
</person>
<person id="1218895">
<attributes>
<attribute name="age" class="java.lang.Integer" >86</attribute>
<attribute name="ptSubscription" class="java.lang.Boolean" >true</attribute>
<attribute name="sex" class="java.lang.String" >f</attribute>
</attributes>
</person>
<person id="10002042">
<attributes>
<attribute name="age" class="java.lang.Integer" >86</attribute>
<attribute name="ptSubscription" class="java.lang.Boolean" >true</attribute>
<attribute name="sex" class="java.lang.String" >f</attribute>
</attributes>
</person>
</population>
我有一个 pandas 数据框,名为 agents
,具有相关的 id
s
id
0 1015600
1 1218895
2 1241567
我想要的是遍历大 XML 并为具有相关 id
的 person
提取 ptSubscription
的值。
所需的输出是具有 id
和值的数据框或列表:
id ptSubscription
0 1015600 false
1 1218895 true
2 1241567 true
我的方法returns一个空输出:
import gzip
import xml.etree.cElementTree as ET
import pandas as pd
from collections import defaultdict
file = 'output_plans.xml.gz'
data = gzip.open(file, 'r')
root = ET.parse(data).getroot()
rows = []
for it in root.iter('person'):
if it.attrib['id'] in agents[["id"]]:
id = it.attrib['id']
age = it.find('attributes/attribute[@name="ptSubscription"]').text
rows.append([id, age])
#root.clear()
pt = pd.DataFrame(rows, columns=['id', 'PTSubscription'])
pt
能够使用 lxml 提取请求信息的通用函数是
from lxml import etree
from io import StringIO
with open("sample.xml") as fd:
tree = etree.parse(fd)
xpath_fmt = '/population/person[@id="{}"]/attributes/attribute[@name="ptSubscription"]'
agents = [1015600,1218895,1241567]
rows = []
for pid in agents:
xpath = xpath_fmt.format(pid)
r = tree.xpath(xpath)
for res in r:
rows.append([pid, res.text])
pd.DataFrame(rows, columns=['id', 'PTSubscription'])
使用标准库,代码将类似于
import xml.etree.cElementTree as ET
with open("sample.xml") as fd:
root = ET.parse(fd).getroot()
xpath_fmt = 'person[@id="{}"]/attributes/attribute[@name="ptSubscription"]'
agents = [1015600,
1218895,
1241567]
rows = []
for pid in agents:
xpath = xpath_fmt.format(pid)
r = root.findall(xpath)
for res in r:
rows.append([pid, res.text])
pd.DataFrame(rows, columns=['id', 'PTSubscription'])
因为 xpath 应该是相对于 population 元素的。
我们可以使用 parsel 来提取详细信息:
#read in data :
with open("test.xml") as fd:
tree = fd.read()
import library and parse xml :
from parsel import Selector
selector = Selector(text=tree, type='xml')
#checklist :
agents = ['1015600','1218895','1241567']
#track the ids
#this checks and selects ids in agents
ids = selector.xpath(f".//person[contains({' '.join(agents)!r},@id)]")
#pair ids with attribute where the name == ptSubscription :
d = {}
for ent in ids:
vals = ent.xpath(".//attribute[@name='ptSubscription']/text()").get()
key = ent.xpath(".//@id").get()
d[key] = vals
print(d)
{'1015600': 'false', '1241567': 'true', '1218895': 'true'}
#put into a dataframe :
pd.DataFrame.from_dict(d,orient='index', columns=['PTSubscription'])
备选方案:使用 python 的内置 ElementTree with elementpath:
import xml.etree.ElementTree as ET
import elementpath
root = ET.parse("test.xml").getroot()
agents = ('1015600','1218895','1241567')
id_path = f".//person[@id={agents}]"
subscription_path = ".//attribute[@name='ptSubscription']/text()"
d = {}
for entry in elementpath.select(root,path):
key = elementpath.select(entry,"./@id")[0]
val = elementpath.select(entry,subscription_path)[0]
d[key] = val
我有一个 XML 结构如下:
<population desc="Switzerland Baseline">
<attributes>
<attribute name="coordinateReferenceSystem" class="java.lang.String" >Atlantis</attribute>
</attributes>
<person id="1015600">
<attributes>
<attribute name="age" class="java.lang.Integer" >86</attribute>
<attribute name="ptSubscription" class="java.lang.Boolean" >false</attribute>
<attribute name="sex" class="java.lang.String" >f</attribute>
</attributes>
</person>
<person id="10002042">
<attributes>
<attribute name="age" class="java.lang.Integer" >86</attribute>
<attribute name="ptSubscription" class="java.lang.Boolean" >false</attribute>
<attribute name="sex" class="java.lang.String" >f</attribute>
</attributes>
</person>
<person id="1241567">
<attributes>
<attribute name="age" class="java.lang.Integer" >86</attribute>
<attribute name="ptSubscription" class="java.lang.Boolean" >true</attribute>
<attribute name="sex" class="java.lang.String" >f</attribute>
</attributes>
</person>
<person id="1218895">
<attributes>
<attribute name="age" class="java.lang.Integer" >86</attribute>
<attribute name="ptSubscription" class="java.lang.Boolean" >true</attribute>
<attribute name="sex" class="java.lang.String" >f</attribute>
</attributes>
</person>
<person id="10002042">
<attributes>
<attribute name="age" class="java.lang.Integer" >86</attribute>
<attribute name="ptSubscription" class="java.lang.Boolean" >true</attribute>
<attribute name="sex" class="java.lang.String" >f</attribute>
</attributes>
</person>
</population>
我有一个 pandas 数据框,名为 agents
,具有相关的 id
s
id
0 1015600
1 1218895
2 1241567
我想要的是遍历大 XML 并为具有相关 id
的 person
提取 ptSubscription
的值。
所需的输出是具有 id
和值的数据框或列表:
id ptSubscription
0 1015600 false
1 1218895 true
2 1241567 true
我的方法returns一个空输出:
import gzip
import xml.etree.cElementTree as ET
import pandas as pd
from collections import defaultdict
file = 'output_plans.xml.gz'
data = gzip.open(file, 'r')
root = ET.parse(data).getroot()
rows = []
for it in root.iter('person'):
if it.attrib['id'] in agents[["id"]]:
id = it.attrib['id']
age = it.find('attributes/attribute[@name="ptSubscription"]').text
rows.append([id, age])
#root.clear()
pt = pd.DataFrame(rows, columns=['id', 'PTSubscription'])
pt
能够使用 lxml 提取请求信息的通用函数是
from lxml import etree
from io import StringIO
with open("sample.xml") as fd:
tree = etree.parse(fd)
xpath_fmt = '/population/person[@id="{}"]/attributes/attribute[@name="ptSubscription"]'
agents = [1015600,1218895,1241567]
rows = []
for pid in agents:
xpath = xpath_fmt.format(pid)
r = tree.xpath(xpath)
for res in r:
rows.append([pid, res.text])
pd.DataFrame(rows, columns=['id', 'PTSubscription'])
使用标准库,代码将类似于
import xml.etree.cElementTree as ET
with open("sample.xml") as fd:
root = ET.parse(fd).getroot()
xpath_fmt = 'person[@id="{}"]/attributes/attribute[@name="ptSubscription"]'
agents = [1015600,
1218895,
1241567]
rows = []
for pid in agents:
xpath = xpath_fmt.format(pid)
r = root.findall(xpath)
for res in r:
rows.append([pid, res.text])
pd.DataFrame(rows, columns=['id', 'PTSubscription'])
因为 xpath 应该是相对于 population 元素的。
我们可以使用 parsel 来提取详细信息:
#read in data :
with open("test.xml") as fd:
tree = fd.read()
import library and parse xml :
from parsel import Selector
selector = Selector(text=tree, type='xml')
#checklist :
agents = ['1015600','1218895','1241567']
#track the ids
#this checks and selects ids in agents
ids = selector.xpath(f".//person[contains({' '.join(agents)!r},@id)]")
#pair ids with attribute where the name == ptSubscription :
d = {}
for ent in ids:
vals = ent.xpath(".//attribute[@name='ptSubscription']/text()").get()
key = ent.xpath(".//@id").get()
d[key] = vals
print(d)
{'1015600': 'false', '1241567': 'true', '1218895': 'true'}
#put into a dataframe :
pd.DataFrame.from_dict(d,orient='index', columns=['PTSubscription'])
备选方案:使用 python 的内置 ElementTree with elementpath:
import xml.etree.ElementTree as ET
import elementpath
root = ET.parse("test.xml").getroot()
agents = ('1015600','1218895','1241567')
id_path = f".//person[@id={agents}]"
subscription_path = ".//attribute[@name='ptSubscription']/text()"
d = {}
for entry in elementpath.select(root,path):
key = elementpath.select(entry,"./@id")[0]
val = elementpath.select(entry,subscription_path)[0]
d[key] = val