使用 python 将深度嵌套的 XML 解析为数据帧 - 与更深层次的元素作斗争

Parsing deeply nested XML into dataframe with python - struggling with deeper elements

我正在尝试解析一个相当嵌套的 XML 文件。我花了最后几个小时试图找到一个没有运气的解决方案。我不确定问题是与名称空间有关,还是需要在循环中查找。

我能够提取更高级别的元素,但没有提取更深层的嵌套元素。我希望将 Part_number、manufacturer_name、名称、产品和零售导出到 df。

XML 样本在这里(所有提交的内容并不完全一致,一些缺失的字段):

<?xml version="1.0" encoding="UTF-8"?><merchandiser xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="merchandiser.xsd"><header><merchantId>35386</merchantId><merchantName>Rock Bottom Golf</merchantName><createdOn>10/13/2021 14:01:49</createdOn></header>
<product product_id='15' name='Champ Golf- Max Pro Spike Wrench' sku_number='19CHPSPWRCH1111111111101' manufacturer_name='Champ Golf' part_number='19CHPSPWRCH1111111111101'><category><primary>Sporting Goods</primary></category><URL><product>https://click.linksynergy.com/link?id=83wh4zNK2Zo&amp;offerid=301124.15&amp;type=15&amp;murl=http%3A%2F%2Fwww.rockbottomgolf.com%2Faccessories%2Fother%2Fchamp-golf-max-pro-spike-wrench%2F%3Futm_source%3Drakuten%26utm_medium%3Dcse%26utm_term%3D19CHPSPWRCH1111111111101</product><productImage>http://d3d71ba2asa5oz.cloudfront.net/40000065/images/19chpspwrch1111111111101.jpg</productImage></URL><description><short>A convenient and easy to use tool. No more struggling with your spikes. Features: Comfortable contoured soft touch dual density handle Three position ratchet for insertion, removal or lock in place Three bits to fit any spike, all will fit in drills Stand</short><long>A convenient and easy to use tool. No more struggling with your spikes. Features: Comfortable contoured soft touch dual density handle Three position ratchet for insertion, removal or lock in place Three bits to fit any spike, all will fit in drills Stand</long></description><discount currency='USD'><type>amount</type></discount><price currency='USD'><retail>9.99</retail></price><brand>Champ Golf</brand><shipping><availability>in-stock</availability></shipping><upc>00036504884013</upc><pixel>https://ad.linksynergy.com/fs-bin/show?id=83wh4zNK2Zo&amp;bids=301124.15&amp;type=15&amp;subid=0</pixel><modification>U</modification></product>
<product product_id='21' name='Stinger Tees- 3&quot; Stinger Pro XL Competition Camo Mid Pack Poly Bag [125 Count]' sku_number='19STGTEEMID3CO1111111101' manufacturer_name='Stinger Tees' part_number='19STGTEEMID3CO1111111101'><category><primary>Sporting Goods</primary><secondary>Outdoor Recreation~~Golf</secondary></category><URL><product>https://click.linksynergy.com/link?id=83wh4zNK2Zo&amp;offerid=301124.21&amp;type=15&amp;murl=http%3A%2F%2Fwww.rockbottomgolf.com%2Faccessories%2Ftees%2Fstinger-tees-3-stinger-pro-xl-competition-camo-mid-pack-poly-bag-125-count%2F%3Futm_source%3Drakuten%26utm_medium%3Dcse%26utm_term%3D19STGTEEMID3CO1111111101</product><productImage>http://d3d71ba2asa5oz.cloudfront.net/40000065/images/3%20tees%20125%20count.jpg</productImage></URL><description><short>Features: Resealable package Less resistance due to a smaller tee head Built to withstand the strongest swings High-quality 120 Tees</short><long>Features: Resealable package Less resistance due to a smaller tee head Built to withstand the strongest swings High-quality 120 Tees</long></description><discount currency='USD'><type>amount</type></discount><price currency='USD'><retail>7.99</retail></price><brand>Stinger Tees</brand><shipping><availability>in-stock</availability></shipping><upc>00853190005047</upc><pixel>https://ad.linksynergy.com/fs-bin/show?id=83wh4zNK2Zo&amp;bids=301124.21&amp;type=15&amp;subid=0</pixel><modification>U</modification></product>
<product product_id='23' name='Vegas Golf- Original Game' sku_number='19VEGORIGIN1111111111101' manufacturer_name='Vegas Golf' part_number='19VEGORIGIN1111111111101'><category><primary>Sporting Goods</primary><secondary>Outdoor Recreation~~Golf</secondary></category><URL><product>https://click.linksynergy.com/link?id=83wh4zNK2Zo&amp;offerid=301124.23&amp;type=15&amp;murl=http%3A%2F%2Fwww.rockbottomgolf.com%2Faccessories%2Fother%2Fvegas-golf-original-game%2F%3Futm_source%3Drakuten%26utm_medium%3Dcse%26utm_term%3D19VEGORIGIN1111111111101</product><productImage>http://d3d71ba2asa5oz.cloudfront.net/40000065/images/19vegorigin1111111111101.jpg</productImage></URL><description><short>For a limited time only, you&apos;ll get 2 bonus chips with your purchase for a total of 10 game chips! Vegas Golf: the ultimate on-the-course gambling game. Vegas Golf consists of real casino style chips, the object is to avoid the negative and obtain the pos</short><long>For a limited time only, you&apos;ll get 2 bonus chips with your purchase for a total of 10 game chips! Vegas Golf: the ultimate on-the-course gambling game. Vegas Golf consists of real casino style chips, the object is to avoid the negative and obtain the pos</long></description><discount currency='USD'><type>amount</type></discount><price currency='USD'><retail>14.99</retail></price><brand>Vegas Golf</brand><shipping><availability>in-stock</availability></shipping><upc>00689076007030</upc><pixel>https://ad.linksynergy.com/fs-bin/show?id=83wh4zNK2Zo&amp;bids=301124.23&amp;type=15&amp;subid=0</pixel><modification>U</modification></product>
<product product_id='28' name='Ray Cook Golf- 12&apos; Compact Cup Ball Retriever' sku_number='19RAYBALRET1111111111201' manufacturer_name='Ray Cook Golf' part_number='19RAYBALRET1111111111201'><category><primary>Sporting Goods</primary><secondary>Outdoor Recreation~~Golf</secondary></category><URL><product>https://click.linksynergy.com/link?id=83wh4zNK2Zo&amp;offerid=301124.28&amp;type=15&amp;murl=http%3A%2F%2Fwww.rockbottomgolf.com%2Faccessories%2Fball-retrievers%2Fray-cook-golf-12-compact-cup-ball-retriever%2F%3Futm_source%3Drakuten%26utm_medium%3Dcse%26utm_term%3D19RAYBALRET1111111111201</product><productImage>http://d3d71ba2asa5oz.cloudfront.net/40000065/images/19raybalret12.jpg</productImage></URL><description><short>The Ray Cook Golf Ball Retriever extends up to 12 feet and is the perfect companion for every golf bag. Features: Durable construction Telescoping shaft design makes the retriever easy to carry</short><long>The Ray Cook Golf Ball Retriever extends up to 12 feet and is the perfect companion for every golf bag. Features: Durable construction Telescoping shaft design makes the retriever easy to carry</long></description><discount currency='USD'><type>amount</type></discount><price currency='USD'><retail>19.99</retail></price><brand>Ray Cook Golf</brand><shipping><availability>in-stock</availability></shipping><upc>00840254178410</upc><pixel>https://ad.linksynergy.com/fs-bin/show?id=83wh4zNK2Zo&amp;bids=301124.28&amp;type=15&amp;subid=0</pixel><modification>U</modification></product>

我创建了下面的 python 代码,它提取了 part_number、manufacturer_name 和名称,而其他两个难以捉摸。

我的代码:

import pandas as pd 
import xml.etree.ElementTree as et 

xtree = et.parse(r"file.xml")
xroot = xtree.getroot() 

df_cols = ["part_number", "manufacturer", "name", "retail", "product"]
rows = []

for node in xroot: 
    part_number = node.attrib.get("part_number")
    manufacturer_name = node.attrib.get("manufacturer_name")
    name = node.attrib.get("name")  
    product = node.findall("product") if node is not None else None
    retail = node.findall("retail") if node is not None else None

    rows.append({"part_number": part_number, "manufacturer": manufacturer_name, "name": name, "retail": retail, "product": product,})


out_df = pd.DataFrame(rows, columns = df_cols)

out_df.head()

我目前的产量(零售,产品出来时是空白的):

                part_number   manufacturer  ... retail product
0                      None           None  ...     []      []
1  19CHPSPWRCH1111111111101     Champ Golf  ...     []      []
2  19STGTEEMID3CO1111111101   Stinger Tees  ...     []      []
3  19VEGORIGIN1111111111101     Vegas Golf  ...     []      []
4  19RAYBALRET1111111111201  Ray Cook Golf  ...     []      []

我想要的输出(为了便于阅读而缩短 URL,但在产品之后是完整的 URL):

                part_number   manufacturer  ... retail product
0                      None           None  ...     9.99     https://click.linksynergy.com/link?id=83...
1  19CHPSPWRCH1111111111101     Champ Golf  ...     7.99      https://click.linksynergy.com/link?id=83...
2  19STGTEEMID3CO1111111101   Stinger Tees  ...     14.99      https://click.linksynergy.com/link?id=83...
3  19VEGORIGIN1111111111101     Vegas Golf  ...     19.99      https://click.linksynergy.com/link?id=83...
4  19RAYBALRET1111111111201  Ray Cook Golf  ...     6.99      https://click.linksynergy.com/link?id=83...

任何帮助将不胜感激!

假设 XML 结构是常数并且 element/attributes 由 xpath 表达式以相同的顺序检索

from lxml import etree
import pandas as pd

df_cols = ["part_number", "manufacturer", "name", "retail", "product"]
rows = []
tree = etree.parse('/home/luis/tmp/tmp.xml')
root = tree.getroot()
steps = tree.xpath('//product/attribute::*[name()="name" or name()="part_number" or name()="manufacturer_name"] | //product/URL/product/text() | //product/price/retail/text()')
i=0
d=dict()
for s in steps:

    if i == 0:
        d[df_cols[2]]=s
    if i == 1:
        d[df_cols[0]]=s
    if i == 2:
        d[df_cols[1]]=s
    if i == 3:
        d[df_cols[3]]=s
    if i == 4:
        d[df_cols[4]]=s
        rows.append(d)
        i=0
        d=dict()
        continue
    i+=1


out_df = pd.DataFrame(rows, columns = df_cols)

print(out_df.head())

结果:

     part_number              manufacturer                                               name                                             retail product
0     Champ Golf  19CHPSPWRCH1111111111101                   Champ Golf- Max Pro Spike Wrench  https://click.linksynergy.com/link?id=83wh4zNK...    9.99
1   Stinger Tees  19STGTEEMID3CO1111111101  Stinger Tees- 3" Stinger Pro XL Competition Ca...  https://click.linksynergy.com/link?id=83wh4zNK...    7.99
2     Vegas Golf  19VEGORIGIN1111111111101                          Vegas Golf- Original Game  https://click.linksynergy.com/link?id=83wh4zNK...   14.99
3  Ray Cook Golf  19RAYBALRET1111111111201      Ray Cook Golf- 12' Compact Cup Ball Retriever  https://click.linksynergy.com/link?id=83wh4zNK...   19.99

见下文

import requests
import xml.etree.ElementTree as ET
import pandas as pd

r = requests.get('https://raw.githubusercontent.com/dgs2021/golfdeals/main/35386_3864840_mp_delta.xml')
attrb_fields =  {'manufacturer_name': 'manufacturer','name':'name','part_number':'part_number'}
sub_elements = {'retail':'retail','product':'product'}

root = ET.fromstring(r.content)

data = []
for p in root.findall('product'):
  entry = {v:p.attrib.get(k,'NA') for k,v in attrb_fields.items()}
  for k,v in sub_elements.items():
    e = p.find(f'.//{v}')
    entry[k] = e.text if e is not None else 'NA'
  data.append(entry)
columns = list(attrb_fields.values()) + list(sub_elements.values())
df = pd.DataFrame(data,columns= columns)
print(df)

输出

          manufacturer  ...                                            product
0           Champ Golf  ...  https://click.linksynergy.com/link?id=83wh4zNK...
1         Stinger Tees  ...  https://click.linksynergy.com/link?id=83wh4zNK...
2           Vegas Golf  ...  https://click.linksynergy.com/link?id=83wh4zNK...
3        Ray Cook Golf  ...  https://click.linksynergy.com/link?id=83wh4zNK...
4     Rock Bottom Golf  ...  https://click.linksynergy.com/link?id=83wh4zNK...
...                ...  ...                                                ...
4100     Callaway Golf  ...  https://click.linksynergy.com/link?id=83wh4zNK...
4101        Cobra Golf  ...  https://click.linksynergy.com/link?id=83wh4zNK...
4102      Odyssey Golf  ...  https://click.linksynergy.com/link?id=83wh4zNK...
4103   TaylorMade Golf  ...  https://click.linksynergy.com/link?id=83wh4zNK...
4104     Titleist Golf  ...  https://click.linksynergy.com/link?id=83wh4zNK...

[4105 rows x 5 columns]