如何使用 python 根据列表中出现的父值提取子值？

Question

我有一个 XML 结构如下：

<population desc="Switzerland Baseline">
    <attributes>
        <attribute name="coordinateReferenceSystem" class="java.lang.String" >Atlantis</attribute>
    </attributes>

    <person id="1015600">
        <attributes>
            <attribute name="age" class="java.lang.Integer" >86</attribute>
            <attribute name="ptSubscription" class="java.lang.Boolean" >false</attribute>
            <attribute name="sex" class="java.lang.String" >f</attribute>
        </attributes>
    </person>
    <person id="10002042">
        <attributes>
            <attribute name="age" class="java.lang.Integer" >86</attribute>
            <attribute name="ptSubscription" class="java.lang.Boolean" >false</attribute>
            <attribute name="sex" class="java.lang.String" >f</attribute>
        </attributes>
    </person>
    <person id="1241567">
        <attributes>
            <attribute name="age" class="java.lang.Integer" >86</attribute>
            <attribute name="ptSubscription" class="java.lang.Boolean" >true</attribute>
            <attribute name="sex" class="java.lang.String" >f</attribute>
        </attributes>
    </person>   
    <person id="1218895">
        <attributes>
            <attribute name="age" class="java.lang.Integer" >86</attribute>
            <attribute name="ptSubscription" class="java.lang.Boolean" >true</attribute>
            <attribute name="sex" class="java.lang.String" >f</attribute>
        </attributes>
    </person>   
    <person id="10002042">
        <attributes>
            <attribute name="age" class="java.lang.Integer" >86</attribute>
            <attribute name="ptSubscription" class="java.lang.Boolean" >true</attribute>
            <attribute name="sex" class="java.lang.String" >f</attribute>
        </attributes>
    </person>
</population>

我有一个 pandas 数据框，名为 agents，具有相关的 ids

    id
0   1015600
1   1218895
2   1241567

我想要的是遍历大 XML 并为具有相关 id 的 person 提取 ptSubscription 的值。

所需的输出是具有 id 和值的数据框或列表：

    id          ptSubscription
0   1015600     false
1   1218895     true
2   1241567     true

我的方法returns一个空输出：

import gzip
import xml.etree.cElementTree as ET
import pandas as pd
from collections import defaultdict

file = 'output_plans.xml.gz'
data = gzip.open(file, 'r')
root = ET.parse(data).getroot()

rows = []
for it in root.iter('person'):
    if it.attrib['id'] in agents[["id"]]:
        id = it.attrib['id']
        age = it.find('attributes/attribute[@name="ptSubscription"]').text
        rows.append([id, age])
#root.clear()

pt = pd.DataFrame(rows, columns=['id', 'PTSubscription'])
pt

Answer 1

能够使用 lxml 提取请求信息的通用函数是

from lxml import etree
from io import StringIO

with open("sample.xml") as fd:
    tree = etree.parse(fd)

xpath_fmt = '/population/person[@id="{}"]/attributes/attribute[@name="ptSubscription"]'


agents = [1015600,1218895,1241567]

rows = []
for pid in agents:
    xpath = xpath_fmt.format(pid)
    r = tree.xpath(xpath)
    for res in r:
        rows.append([pid, res.text])

pd.DataFrame(rows, columns=['id', 'PTSubscription'])

使用标准库，代码将类似于

import xml.etree.cElementTree as ET

with open("sample.xml") as fd:
    root = ET.parse(fd).getroot()

xpath_fmt = 'person[@id="{}"]/attributes/attribute[@name="ptSubscription"]'


agents = [1015600,
1218895,
1241567]

rows = []
for pid in agents:
    xpath = xpath_fmt.format(pid)
    r = root.findall(xpath)
    for res in r:
        rows.append([pid, res.text])

pd.DataFrame(rows, columns=['id', 'PTSubscription'])

因为 xpath 应该是相对于 population 元素的。

Answer 2

我们可以使用 parsel 来提取详细信息：

#read in data : 

with open("test.xml") as fd:
    tree = fd.read()

import library and parse xml :
from parsel import Selector

selector = Selector(text=tree, type='xml')

#checklist : 
agents = ['1015600','1218895','1241567']

#track the ids
#this checks and selects ids in agents
ids = selector.xpath(f".//person[contains({' '.join(agents)!r},@id)]")

#pair ids with attribute where the name == ptSubscription : 

d = {}
for ent in ids:
    vals = ent.xpath(".//attribute[@name='ptSubscription']/text()").get()
    key = ent.xpath(".//@id").get()
    d[key] = vals

print(d)

{'1015600': 'false', '1241567': 'true', '1218895': 'true'}

#put into a dataframe : 
pd.DataFrame.from_dict(d,orient='index', columns=['PTSubscription'])

备选方案：使用 python 的内置 ElementTree with elementpath：

import xml.etree.ElementTree as ET
import elementpath
root = ET.parse("test.xml").getroot()

agents = ('1015600','1218895','1241567')

id_path = f".//person[@id={agents}]"
subscription_path = ".//attribute[@name='ptSubscription']/text()"

d = {}
for entry in elementpath.select(root,path):
    key = elementpath.select(entry,"./@id")[0]
    val = elementpath.select(entry,subscription_path)[0]
    d[key] = val

如何使用 python 根据列表中出现的父值提取子值？

How to extract child values depending on parent values appearence in list, using python?

python

xml

elementtree

pandas