Python XML 到第二个数据框中的 CSV 查找值并列出多个值
Python XML to CSV lookup values in second dataframe and list multiple values
这与 有关。感谢@aneroid 帮助我入门。
我能够取出我的列和行,还有两个问题我不知道要搜索什么来找到答案。
剩余问题 1:我的 XML 的某些属性有多达 20 个值。我想在一个逗号分隔的字符串中列出这些。请参阅下面示例中的“年”。
我发现了一些关于分组的东西,但我认为这不会起作用,因为我只是从 XML 中得到第一个值。我需要列出 XML 中的所有匹配值。另外我的完整报告有 120 列,所以我需要列出所有列作为分组依据吗?
(更新开始)我研究了 lxml.etree,现在可以得到这个:
['2019', '2020']
我更新了下面的 Python。如果有人可以帮助完成 2019 年、2020 年的最后阶段,那就太棒了。问题 2 类似——只需要从字典中提取值。 (更新结束)
剩余问题 2:XML 按名称列出相关事物,然后再次将它们作为单独的“行”列出,并带有事物的附加属性。我需要在报告中包含这些附加属性之一的值。在我的示例 Python 中,我创建了第二个数据框,名为 thing_df,其中包含名称和 ID 属性。我需要将 coll_df 中的事物名称与 thing_df 中的事物名称相匹配以获取其 thing_id 并将其添加到 coll_df.
我发现了一些关于合并数据集的东西,但这些例子似乎是为了合并我在我的例子中所说的集合,而不是我正在寻找的东西。
期望的输出:
,Collection item,ITEM-ID,ATTRIB-1,PERSON-TYPE-1-NAME,ATTRIB-2,PERSON-TYPE-2-NAME,RELATED-THING-1 id,RELATED-THING-2 IDs,Years
0,name of Item 1,item_000001,Yes,name of person 1,Yes,name of person 2,thing_000745,"thing_000783, thing_000803","2019, 2020"
Python(更新):
# -*- coding: utf-8 -*-
# Importing the required libraries
#import xml.etree.ElementTree as Xet
import lxml.etree as Xet
import pandas as pd
#Define main collection columns and rows for dataframe
coll_cols = ["Collection item", "ITEM-ID", "ATTRIB-1", "PERSON-TYPE-1-NAME" ,
"ATTRIB-2", "PERSON-TYPE-2-NAME", "RELATED-THING-1 id",
"RELATED-THING-2 IDs", "Years"]
coll_rows = []
#Define thing lookup dataframe columns and rows
thing_cols = ["Thing Name", "Thing ID"]
thing_rows = []
# Parsing the XML file
xmlparse = Xet.parse('sample.xml')
root = xmlparse.getroot()
for row in root:
# Create thing lookup dataframe
if (
row.findtext('type') == "THING-TYPE-1" or
row.findtext('type') == "THING-TYPE-2"
):
thing_id = row.findtext("THING-ID")
thing_name = row.findtext("name")
thing_rows.append({"Thing Name": thing_name,
"Thing ID": thing_id})
thing_df = pd.DataFrame(thing_rows, columns=thing_cols)
# Find only collection items
if row.findtext('type') != "COLLECTION-ITEM":
continue
# Define values for collection item dataframe
name = row.findtext("name", "Missing name")
item_id = row.findtext("ITEM-ID", "Missing item ID")
attrib_1 = row.findtext("ATTRIB-1", "Missing attribute 1")
p1_name = row.findtext("./PERSON-TYPE-1-NAME/result/row/name")
attrib_2 = row.findtext("ATTRIB-2", "Missing attribute 2")
p2_name = row.findtext("./PERSON-TYPE-2-NAME/result/row/name")
relat_thing1 = row.xpath("./RELATED-THING-1/result/row/name/text()")
#relat_thing1_id = look up relat_thing1 in infr_df as "Thing Name" \
# and return "Thing ID"
relat_thing2 = row.xpath("./RELATED-THING-2/result/row/name/text()")
#relat_thing2_id = look up every relat_thing2 in infr_df as "Thing Name" \
# and return all "Thing ID"
years = row.xpath("./RPTD-HIST-CODE/result/row/name/text()")
coll_rows.append({"Collection item": name,
"ITEM-ID": item_id,
"ATTRIB-1": attrib_1,
"PERSON-TYPE-1-NAME": p1_name,
"ATTRIB-2": attrib_2,
"PERSON-TYPE-2-NAME": p2_name,
#"RELATED-THING-1 id": relat_thing1_id,
#"RELATED-THING-2 IDs": relat_thing2_ids,
"Years": years
})
coll_df = pd.DataFrame(coll_rows, columns=coll_cols)
# Writing dataframe to csv
coll_df.to_csv('output.csv')
除了在数据框中查找“事物名称”以获取“事物 ID”之外,我已经解决了所有问题。我将 post 一个新的更简单的问题来关注这个问题。
这是新的 Python,它解决了我的剩余问题 1 和一半的剩余问题 2:
# -*- coding: utf-8 -*-
# Importing the required libraries
#import xml.etree.ElementTree as Xet
import lxml.etree as Xet
import pandas as pd
#Define main collection columns and rows for dataframe
coll_cols = ["Collection item", "ITEM-ID", "ATTRIB-1", "PERSON-TYPE-1-NAME" ,
"ATTRIB-2", "PERSON-TYPE-2-NAME", "RELATED-THING-1 id",
"RELATED-THING-2 IDs", "Years"]
coll_rows = []
#Define thing lookup dataframe columns and rows
thing_cols = ["Thing Name", "Thing ID"]
thing_rows = []
# Parsing the XML file
xmlparse = Xet.parse('sample.xml')
root = xmlparse.getroot()
for row in root:
# Create thing lookup dataframe
if (
row.findtext('type') == "THING-TYPE-1" or
row.findtext('type') == "THING-TYPE-2"
):
thing_id = row.findtext("THING-ID")
thing_name = row.findtext("name")
thing_rows.append({"Thing Name": thing_name,
"Thing ID": thing_id})
thing_df = pd.DataFrame(thing_rows, columns=thing_cols)
# Find only collection items
if row.findtext('type') != "COLLECTION-ITEM":
continue
# Define values for collection item dataframe
name = row.findtext("name", "Missing name")
item_id = row.findtext("ITEM-ID", "Missing item ID")
attrib_1 = row.findtext("ATTRIB-1", "Missing attribute 1")
p1_name = row.findtext("./PERSON-TYPE-1-NAME/result/row/name")
attrib_2 = row.findtext("ATTRIB-2", "Missing attribute 2")
p2_name = row.findtext("./PERSON-TYPE-2-NAME/result/row/name")
relat_thing1_items = row.xpath("./RELATED-THING-1/result/row/name/text()")
if len(relat_thing1_items) > 0:
relat_thing1 = ', '.join(relat_thing1_items)
else:
relat_thing1 = ""
#relat_thing1_id = look up relat_thing1 in infr_df as "Thing Name" \
# and return "Thing ID"
relat_thing2_items = row.xpath("./RELATED-THING-2/result/row/name/text()")
if len(relat_thing2_items) > 0:
relat_thing2 = ', '.join(relat_thing2_items)
else:
relat_thing2 = ""
#relat_thing2_id = look up every relat_thing2 in infr_df as "Thing Name" \
# and return all "Thing ID"
year_items = row.xpath("./RPTD-HIST-CODE/result/row/name/text()")
if len(year_items) > 0:
years = ', '.join(year_items)
else:
years = ""
coll_rows.append({"Collection item": name,
"ITEM-ID": item_id,
"ATTRIB-1": attrib_1,
"PERSON-TYPE-1-NAME": p1_name,
"ATTRIB-2": attrib_2,
"PERSON-TYPE-2-NAME": p2_name,
"RELATED-THING-1 id": relat_thing1,
"RELATED-THING-2 IDs": relat_thing2,
"Years": years
})
coll_df = pd.DataFrame(coll_rows, columns=coll_cols)
# Writing dataframe to csv
coll_df.to_csv('output.csv')
这与
我能够取出我的列和行,还有两个问题我不知道要搜索什么来找到答案。
剩余问题 1:我的 XML 的某些属性有多达 20 个值。我想在一个逗号分隔的字符串中列出这些。请参阅下面示例中的“年”。
我发现了一些关于分组的东西,但我认为这不会起作用,因为我只是从 XML 中得到第一个值。我需要列出 XML 中的所有匹配值。另外我的完整报告有 120 列,所以我需要列出所有列作为分组依据吗?
(更新开始)我研究了 lxml.etree,现在可以得到这个:
['2019', '2020']
我更新了下面的 Python。如果有人可以帮助完成 2019 年、2020 年的最后阶段,那就太棒了。问题 2 类似——只需要从字典中提取值。 (更新结束)
剩余问题 2:XML 按名称列出相关事物,然后再次将它们作为单独的“行”列出,并带有事物的附加属性。我需要在报告中包含这些附加属性之一的值。在我的示例 Python 中,我创建了第二个数据框,名为 thing_df,其中包含名称和 ID 属性。我需要将 coll_df 中的事物名称与 thing_df 中的事物名称相匹配以获取其 thing_id 并将其添加到 coll_df.
我发现了一些关于合并数据集的东西,但这些例子似乎是为了合并我在我的例子中所说的集合,而不是我正在寻找的东西。
期望的输出:
,Collection item,ITEM-ID,ATTRIB-1,PERSON-TYPE-1-NAME,ATTRIB-2,PERSON-TYPE-2-NAME,RELATED-THING-1 id,RELATED-THING-2 IDs,Years
0,name of Item 1,item_000001,Yes,name of person 1,Yes,name of person 2,thing_000745,"thing_000783, thing_000803","2019, 2020"
Python(更新):
# -*- coding: utf-8 -*-
# Importing the required libraries
#import xml.etree.ElementTree as Xet
import lxml.etree as Xet
import pandas as pd
#Define main collection columns and rows for dataframe
coll_cols = ["Collection item", "ITEM-ID", "ATTRIB-1", "PERSON-TYPE-1-NAME" ,
"ATTRIB-2", "PERSON-TYPE-2-NAME", "RELATED-THING-1 id",
"RELATED-THING-2 IDs", "Years"]
coll_rows = []
#Define thing lookup dataframe columns and rows
thing_cols = ["Thing Name", "Thing ID"]
thing_rows = []
# Parsing the XML file
xmlparse = Xet.parse('sample.xml')
root = xmlparse.getroot()
for row in root:
# Create thing lookup dataframe
if (
row.findtext('type') == "THING-TYPE-1" or
row.findtext('type') == "THING-TYPE-2"
):
thing_id = row.findtext("THING-ID")
thing_name = row.findtext("name")
thing_rows.append({"Thing Name": thing_name,
"Thing ID": thing_id})
thing_df = pd.DataFrame(thing_rows, columns=thing_cols)
# Find only collection items
if row.findtext('type') != "COLLECTION-ITEM":
continue
# Define values for collection item dataframe
name = row.findtext("name", "Missing name")
item_id = row.findtext("ITEM-ID", "Missing item ID")
attrib_1 = row.findtext("ATTRIB-1", "Missing attribute 1")
p1_name = row.findtext("./PERSON-TYPE-1-NAME/result/row/name")
attrib_2 = row.findtext("ATTRIB-2", "Missing attribute 2")
p2_name = row.findtext("./PERSON-TYPE-2-NAME/result/row/name")
relat_thing1 = row.xpath("./RELATED-THING-1/result/row/name/text()")
#relat_thing1_id = look up relat_thing1 in infr_df as "Thing Name" \
# and return "Thing ID"
relat_thing2 = row.xpath("./RELATED-THING-2/result/row/name/text()")
#relat_thing2_id = look up every relat_thing2 in infr_df as "Thing Name" \
# and return all "Thing ID"
years = row.xpath("./RPTD-HIST-CODE/result/row/name/text()")
coll_rows.append({"Collection item": name,
"ITEM-ID": item_id,
"ATTRIB-1": attrib_1,
"PERSON-TYPE-1-NAME": p1_name,
"ATTRIB-2": attrib_2,
"PERSON-TYPE-2-NAME": p2_name,
#"RELATED-THING-1 id": relat_thing1_id,
#"RELATED-THING-2 IDs": relat_thing2_ids,
"Years": years
})
coll_df = pd.DataFrame(coll_rows, columns=coll_cols)
# Writing dataframe to csv
coll_df.to_csv('output.csv')
除了在数据框中查找“事物名称”以获取“事物 ID”之外,我已经解决了所有问题。我将 post 一个新的更简单的问题来关注这个问题。
这是新的 Python,它解决了我的剩余问题 1 和一半的剩余问题 2:
# -*- coding: utf-8 -*-
# Importing the required libraries
#import xml.etree.ElementTree as Xet
import lxml.etree as Xet
import pandas as pd
#Define main collection columns and rows for dataframe
coll_cols = ["Collection item", "ITEM-ID", "ATTRIB-1", "PERSON-TYPE-1-NAME" ,
"ATTRIB-2", "PERSON-TYPE-2-NAME", "RELATED-THING-1 id",
"RELATED-THING-2 IDs", "Years"]
coll_rows = []
#Define thing lookup dataframe columns and rows
thing_cols = ["Thing Name", "Thing ID"]
thing_rows = []
# Parsing the XML file
xmlparse = Xet.parse('sample.xml')
root = xmlparse.getroot()
for row in root:
# Create thing lookup dataframe
if (
row.findtext('type') == "THING-TYPE-1" or
row.findtext('type') == "THING-TYPE-2"
):
thing_id = row.findtext("THING-ID")
thing_name = row.findtext("name")
thing_rows.append({"Thing Name": thing_name,
"Thing ID": thing_id})
thing_df = pd.DataFrame(thing_rows, columns=thing_cols)
# Find only collection items
if row.findtext('type') != "COLLECTION-ITEM":
continue
# Define values for collection item dataframe
name = row.findtext("name", "Missing name")
item_id = row.findtext("ITEM-ID", "Missing item ID")
attrib_1 = row.findtext("ATTRIB-1", "Missing attribute 1")
p1_name = row.findtext("./PERSON-TYPE-1-NAME/result/row/name")
attrib_2 = row.findtext("ATTRIB-2", "Missing attribute 2")
p2_name = row.findtext("./PERSON-TYPE-2-NAME/result/row/name")
relat_thing1_items = row.xpath("./RELATED-THING-1/result/row/name/text()")
if len(relat_thing1_items) > 0:
relat_thing1 = ', '.join(relat_thing1_items)
else:
relat_thing1 = ""
#relat_thing1_id = look up relat_thing1 in infr_df as "Thing Name" \
# and return "Thing ID"
relat_thing2_items = row.xpath("./RELATED-THING-2/result/row/name/text()")
if len(relat_thing2_items) > 0:
relat_thing2 = ', '.join(relat_thing2_items)
else:
relat_thing2 = ""
#relat_thing2_id = look up every relat_thing2 in infr_df as "Thing Name" \
# and return all "Thing ID"
year_items = row.xpath("./RPTD-HIST-CODE/result/row/name/text()")
if len(year_items) > 0:
years = ', '.join(year_items)
else:
years = ""
coll_rows.append({"Collection item": name,
"ITEM-ID": item_id,
"ATTRIB-1": attrib_1,
"PERSON-TYPE-1-NAME": p1_name,
"ATTRIB-2": attrib_2,
"PERSON-TYPE-2-NAME": p2_name,
"RELATED-THING-1 id": relat_thing1,
"RELATED-THING-2 IDs": relat_thing2,
"Years": years
})
coll_df = pd.DataFrame(coll_rows, columns=coll_cols)
# Writing dataframe to csv
coll_df.to_csv('output.csv')