如果另一个属性出现在具有 python 的给定列表中,如何提取 XML 属性?

How to extract XML attributes, if another attribute occurs within a given list with python?

我有一份 linkId 的清单。

links_o_i = [652518,  345004, 225317, 177396, 551734]

此外,我有一个 XML 文件,其结构如下:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE facilities SYSTEM "http://www.matsim.org/files/dtd/facilities_v1.dtd">
<facilities name="Facilities from different sources">

<!-- ====================================================================== -->

    <facility id="10002" linkId="666355" x="2684102.0" y="1253168.0">
        <activity type="other">
        </activity>

        <activity type="work">
        </activity>

    </facility>

<!-- ====================================================================== -->

    <facility id="10007" linkId="961312" x="2683486.0" y="1247853.0">
        <activity type="other">
        </activity>

        <activity type="work">
        </activity>

    </facility>

<!-- ====================================================================== -->

    <facility id="100070" linkId="652518" x="2684238.0" y="1246568.0">
        <activity type="leisure">
        </activity>

        <activity type="other">
        </activity>

        <activity type="work">
        </activity>

    </facility>

<!-- ====================================================================== -->

    <facility id="100071" linkId="1063278" x="2689220.0" y="1243493.0">
        <activity type="leisure">
        </activity>

        <activity type="other">
        </activity>

        <activity type="work">
        </activity>

    </facility>

<!-- ====================================================================== -->

    <facility id="100072" linkId="786540" x="2680812.0" y="1249375.0">
        <activity type="leisure">
        </activity>

        <activity type="other">
        </activity>

        <activity type="work">
        </activity>

    </facility>

<!-- ====================================================================== -->

    <facility id="100073" linkId="225317" x="2681506.0" y="1249508.0">
        <activity type="other">
        </activity>

        <activity type="shop">
        </activity>

        <activity type="work">
        </activity>

    </facility>

</facilities>

我想解析 XML 文件并提取 facility 的相应 xy 值,它们有一个 linkIdthat在 links_o_i 列表中。

目标是一个三列数据框,具有 linkIdxy 值。

到目前为止,我的方法一无所获,我很难找到原因。必须注意的是,该列表以及 XML 要大得多。

import gzip
import xml.etree.ElementTree as ET
from collections import defaultdict
import pandas as pd


tree = ET.iterparse(gzip.open("file.xml.gz", 'r'))
link_coords = defaultdict(list)
for xml_event, elem in tree:
    attributes = elem.attrib
    if elem.tag == 'facility' \
    and elem.attrib["linkId"] in links_o_i:
        link_coords[attributes['linkId']].append[attributes['x', 'y']]
    elem.clear()  
link_coords = pd.DataFrame.from_dict(link_coords)

您可以使用 xmltodict 将数据解析为字典格式,并提取您的数据:

extract = [{k:v for k,v in ent.items() if k in ['@linkId','@x','@y']}
           for ent in xmltodict.parse(data)['facilities']['facility']]

#filter for only entries in the list
res = [ent for ent in extract if int(ent['@linkId']) in links_o_i]

#read into dataframe
pd.DataFrame(res)

     @linkId    @x          @y
0   652518  2684238.0   1246568.0
1   225317  2681506.0   1249508.0