Python XML 到第二个数据框中的 CSV 查找值并列出多个值

Python XML to CSV lookup values in second dataframe and list multiple values

这与 有关。感谢@aneroid 帮助我入门。

我能够取出我的列和行,还有两个问题我不知道要搜索什么来找到答案。

剩余问题 1:我的 XML 的某些属性有多达 20 个值。我想在一个逗号分隔的字符串中列出这些。请参阅下面示例中的“年”。

我发现了一些关于分组的东西,但我认为这不会起作用,因为我只是从 XML 中得到第一个值。我需要列出 XML 中的所有匹配值。另外我的完整报告有 120 列,所以我需要列出所有列作为分组依据吗?

(更新开始)我研究了 lxml.etree,现在可以得到这个:

['2019', '2020']

我更新了下面的 Python。如果有人可以帮助完成 2019 年、2020 年的最后阶段,那就太棒了。问题 2 类似——只需要从字典中提取值。 (更新结束)

剩余问题 2:XML 按名称列出相关事物,然后再次将它们作为单独的“行”列出,并带有事物的附加属性。我需要在报告中包含这些附加属性之一的值。在我的示例 Python 中,我创建了第二个数据框,名为 thing_df,其中包含名称和 ID 属性。我需要将 coll_df 中的事物名称与 thing_df 中的事物名称相匹配以获取其 thing_id 并将其添加到 coll_df.

我发现了一些关于合并数据集的东西,但这些例子似乎是为了合并我在我的例子中所说的集合,而不是我正在寻找的东西。

期望的输出:

,Collection item,ITEM-ID,ATTRIB-1,PERSON-TYPE-1-NAME,ATTRIB-2,PERSON-TYPE-2-NAME,RELATED-THING-1 id,RELATED-THING-2 IDs,Years
0,name of Item 1,item_000001,Yes,name of person 1,Yes,name of person 2,thing_000745,"thing_000783, thing_000803","2019, 2020"

Python(更新):

# -*- coding: utf-8 -*-

# Importing the required libraries
#import xml.etree.ElementTree as Xet
import lxml.etree as Xet
import pandas as pd

#Define main collection columns and rows for dataframe
coll_cols = ["Collection item", "ITEM-ID", "ATTRIB-1", "PERSON-TYPE-1-NAME" ,
        "ATTRIB-2", "PERSON-TYPE-2-NAME", "RELATED-THING-1 id",
        "RELATED-THING-2 IDs", "Years"]
coll_rows = []
#Define thing lookup dataframe columns and rows
thing_cols = ["Thing Name", "Thing ID"]
thing_rows = []

# Parsing the XML file
xmlparse = Xet.parse('sample.xml')
root = xmlparse.getroot()
for row in root:
    # Create thing lookup dataframe
    if (
            row.findtext('type') == "THING-TYPE-1" or
            row.findtext('type') == "THING-TYPE-2"
            ):
        thing_id = row.findtext("THING-ID")
        thing_name = row.findtext("name")
        thing_rows.append({"Thing Name": thing_name,
                           "Thing ID": thing_id})
        thing_df = pd.DataFrame(thing_rows, columns=thing_cols)
    # Find only collection items
    if row.findtext('type') != "COLLECTION-ITEM":
        continue
    # Define values for collection item dataframe
    name = row.findtext("name", "Missing name")
    item_id = row.findtext("ITEM-ID", "Missing item ID")
    attrib_1 = row.findtext("ATTRIB-1", "Missing attribute 1")
    p1_name = row.findtext("./PERSON-TYPE-1-NAME/result/row/name")
    attrib_2 = row.findtext("ATTRIB-2", "Missing attribute 2")
    p2_name = row.findtext("./PERSON-TYPE-2-NAME/result/row/name")
    relat_thing1 = row.xpath("./RELATED-THING-1/result/row/name/text()")
    #relat_thing1_id = look up relat_thing1 in infr_df as "Thing Name" \
    #    and return "Thing ID"
    relat_thing2 = row.xpath("./RELATED-THING-2/result/row/name/text()")
    #relat_thing2_id = look up every relat_thing2 in infr_df as "Thing Name" \
    #    and return all "Thing ID"
    years = row.xpath("./RPTD-HIST-CODE/result/row/name/text()")

    coll_rows.append({"Collection item": name,
                 "ITEM-ID": item_id,
                 "ATTRIB-1": attrib_1,
                 "PERSON-TYPE-1-NAME": p1_name,
                 "ATTRIB-2": attrib_2,
                 "PERSON-TYPE-2-NAME": p2_name,
                 #"RELATED-THING-1 id": relat_thing1_id,
                 #"RELATED-THING-2 IDs": relat_thing2_ids,
                 "Years": years
})

coll_df = pd.DataFrame(coll_rows, columns=coll_cols)

# Writing dataframe to csv
coll_df.to_csv('output.csv')

除了在数据框中查找“事物名称”以获取“事物 ID”之外,我已经解决了所有问题。我将 post 一个新的更简单的问题来关注这个问题。

这是新的 Python,它解决了我的剩余问题 1 和一半的剩余问题 2:

# -*- coding: utf-8 -*-

# Importing the required libraries
#import xml.etree.ElementTree as Xet
import lxml.etree as Xet
import pandas as pd

#Define main collection columns and rows for dataframe
coll_cols = ["Collection item", "ITEM-ID", "ATTRIB-1", "PERSON-TYPE-1-NAME" ,
        "ATTRIB-2", "PERSON-TYPE-2-NAME", "RELATED-THING-1 id",
        "RELATED-THING-2 IDs", "Years"]
coll_rows = []
#Define thing lookup dataframe columns and rows
thing_cols = ["Thing Name", "Thing ID"]
thing_rows = []

# Parsing the XML file
xmlparse = Xet.parse('sample.xml')
root = xmlparse.getroot()
for row in root:
    # Create thing lookup dataframe
    if (
            row.findtext('type') == "THING-TYPE-1" or
            row.findtext('type') == "THING-TYPE-2"
            ):
        thing_id = row.findtext("THING-ID")
        thing_name = row.findtext("name")
        thing_rows.append({"Thing Name": thing_name,
                           "Thing ID": thing_id})
        thing_df = pd.DataFrame(thing_rows, columns=thing_cols)
    # Find only collection items
    if row.findtext('type') != "COLLECTION-ITEM":
        continue
    # Define values for collection item dataframe
    name = row.findtext("name", "Missing name")
    item_id = row.findtext("ITEM-ID", "Missing item ID")
    attrib_1 = row.findtext("ATTRIB-1", "Missing attribute 1")
    p1_name = row.findtext("./PERSON-TYPE-1-NAME/result/row/name")
    attrib_2 = row.findtext("ATTRIB-2", "Missing attribute 2")
    p2_name = row.findtext("./PERSON-TYPE-2-NAME/result/row/name")
    relat_thing1_items = row.xpath("./RELATED-THING-1/result/row/name/text()")
    if len(relat_thing1_items) > 0:
        relat_thing1 = ', '.join(relat_thing1_items)
    else:
        relat_thing1 = ""
    #relat_thing1_id = look up relat_thing1 in infr_df as "Thing Name" \
    #    and return "Thing ID"
    relat_thing2_items = row.xpath("./RELATED-THING-2/result/row/name/text()")
    if len(relat_thing2_items) > 0:
        relat_thing2 = ', '.join(relat_thing2_items)
    else:
        relat_thing2 = ""
    #relat_thing2_id = look up every relat_thing2 in infr_df as "Thing Name" \
    #    and return all "Thing ID"
    year_items = row.xpath("./RPTD-HIST-CODE/result/row/name/text()")
    if len(year_items) > 0:
        years = ', '.join(year_items)
    else:
        years = ""

    coll_rows.append({"Collection item": name,
                 "ITEM-ID": item_id,
                 "ATTRIB-1": attrib_1,
                 "PERSON-TYPE-1-NAME": p1_name,
                 "ATTRIB-2": attrib_2,
                 "PERSON-TYPE-2-NAME": p2_name,
                 "RELATED-THING-1 id": relat_thing1,
                 "RELATED-THING-2 IDs": relat_thing2,
                 "Years": years
})

coll_df = pd.DataFrame(coll_rows, columns=coll_cols)

# Writing dataframe to csv
coll_df.to_csv('output.csv')