Trying to parse through an xml file and put the text data into a dictionary using Python. keyerror: 0

Trying to parse through an xml file and put the text data into a dictionary using Python. keyerror: 0

我正在使用 Python elementTree 包解析 XML 文件(如下)。

<?xml version="1.0" encoding="Cp1252"?>
<CATALOG>
  <CD>
    <COLUMN NAME='TITLE'>Empire Burlesque</COLUMN>
    <COLUMN NAME='ARTIST'>Bob Dylan</COLUMN>
    <COLUMN NAME='COUNTRY'>USA</COLUMN>
    <COLUMN NAME='COMPANY'>Columbia</COLUMN>
    <COLUMN NAME='PRICE'>10.90</COLUMN>
    <COLUMN NAME='YEAR'>1985</COLUMN>
  </CD>
  <CD>
    <COLUMN NAME='TITLE'>Hide your heart</COLUMN>
    <COLUMN NAME='ARTIST'>Bonnie Tyler</COLUMN>
    <COLUMN NAME='COUNTRY'>UK</COLUMN>
    <COLUMN NAME='COMPANY'>CBS Records</COLUMN>
    <COLUMN NAME='PRICE'>9.90</COLUMN>
    <COLUMN NAME='YEAR'>1988</COLUMN>
  </CD>
  <CD>
    <COLUMN NAME='TITLE'>Greatest Hits</COLUMN>
    <COLUMN NAME='ARTIST'>Dolly Parton</COLUMN>
    <COLUMN NAME='COUNTRY'>USA</COLUMN>
    <COLUMN NAME='COMPANY'>RCA</COLUMN>
    <COLUMN NAME='PRICE'>9.90</COLUMN>
    <COLUMN NAME='YEAR'>1982</COLUMN>
  </CD>
</CATALOG>

现在,我想将每个属性(即 TITLE、ARTIST 等)的完整文本放入字典文件中,然后将文本写入 csv 文件中的每一行。下面是我的 python 程序。

import xml.etree.ElementTree as ET
from xml.etree.ElementTree import fromstring
import csv

tree = ET.parse('sample.xml')
root = tree.getroot()

orgdata = {}
orglist = []

csv_columns = ['TITLE','ARTIST','COUNTRY','COMPANY','PRICE','YEAR']

count = 0
for child in root:
    for sub in child:
        if sub.attrib.get('NAME') == 'TITLE':
            orgdata['TITLE'] = sub.text
        if sub.attrib.get('NAME') == 'ARTIST':
            orgdata['ARTIST'] = sub.text
        if sub.attrib.get('NAME') == 'COUNTRY':
            orgdata['COUNTRY'] = sub.text
        if sub.attrib.get('NAME') == 'COMPANY':
            orgdata['COMPANY'] = sub.text
        if sub.attrib.get('NAME') == 'PRICE':
            orgdata['PRICE'] = sub.text
        if sub.attrib.get('NAME') == 'YEAR':
            orgdata['YEAR'] = sub.text
        tocsv = orgdata
orglist.append(orgdata)
k = tocsv[0].keys()
with open('orgfile.txt','w+') as csvfile:
    dic = csv.DictWriter(csvfile,k,delimiter='|',extrasaction='ignore')
    dic.writeheader()
    dic.writerows(tocsv)

此代码生成密钥 error:0

$ python sample.py                                                                                                                                                      Traceback (most recent call last):
  File "sample.py", line 30, in <module>
    k = tocsv[0].keys()
KeyError: 0

有没有办法解决这个问题并在没有重复的情况下将数据导入 CSV 文件?

也许可以使用 findall:

稍微简化一下
In [20]: x = """
    ...: <CATALOG>
    ...:   <CD>
    ...:     <COLUMN NAME='TITLE'>Empire Burlesque</COLUMN>
    ...:     <COLUMN NAME='ARTIST'>Bob Dylan</COLUMN>
    ...:     <COLUMN NAME='COUNTRY'>USA</COLUMN>
    ...:     <COLUMN NAME='COMPANY'>Columbia</COLUMN>
    ...:     <COLUMN NAME='PRICE'>10.90</COLUMN>
    ...:     <COLUMN NAME='YEAR'>1985</COLUMN>
    ...:   </CD>
    ...:   <CD>
    ...:     <COLUMN NAME='TITLE'>Hide your heart</COLUMN>
    ...:     <COLUMN NAME='ARTIST'>Bonnie Tyler</COLUMN>
    ...:     <COLUMN NAME='COUNTRY'>UK</COLUMN>
    ...:     <COLUMN NAME='COMPANY'>CBS Records</COLUMN>
    ...:     <COLUMN NAME='PRICE'>9.90</COLUMN>
    ...:     <COLUMN NAME='YEAR'>1988</COLUMN>
    ...:   </CD>
    ...:   <CD>
    ...:     <COLUMN NAME='TITLE'>Greatest Hits</COLUMN>
    ...:     <COLUMN NAME='ARTIST'>Dolly Parton</COLUMN>
    ...:     <COLUMN NAME='COUNTRY'>USA</COLUMN>
    ...:     <COLUMN NAME='COMPANY'>RCA</COLUMN>
    ...:     <COLUMN NAME='PRICE'>9.90</COLUMN>
    ...:     <COLUMN NAME='YEAR'>1982</COLUMN>
    ...:   </CD>
    ...: </CATALOG>"""

In [21]:

In [21]: xdata = fromstring(x)

In [22]: results = []

In [23]: for cd in xdata.findall('.//CD'):
    ...:     each_result = {}
    ...:     for each in cd.findall('.//COLUMN'):
    ...:         each_result[each.attrib.get('NAME')] = each.text
    ...:     results.append(each_result)

这导致:

In [24]: results
Out[24]:
[{'TITLE': 'Empire Burlesque',
  'ARTIST': 'Bob Dylan',
  'COUNTRY': 'USA',
  'COMPANY': 'Columbia',
  'PRICE': '10.90',
  'YEAR': '1985'},
 {'TITLE': 'Hide your heart',
  'ARTIST': 'Bonnie Tyler',
  'COUNTRY': 'UK',
  'COMPANY': 'CBS Records',
  'PRICE': '9.90',
  'YEAR': '1988'},
 {'TITLE': 'Greatest Hits',
  'ARTIST': 'Dolly Parton',
  'COUNTRY': 'USA',
  'COMPANY': 'RCA',
  'PRICE': '9.90',
  'YEAR': '1982'}]

首先,我认为您的意思是 orglist[0].keys() 而不是 tocsv[0].keys()。这将解决您的错误。

根据你的第二个问题是:

Is there a way to fix this and get the data into the CSV file without the duplicates?

答案是肯定的,您可以使用 pandas.DataFrame 仅用三行代码就可以做到这一点,如下所示:

>>> import pandas as pd

>>> df = pd.DataFrame(orglist)
>>> df.drop_duplicates(inplace=True)
>>> print(df)

编辑

因此,您的代码应如下所示:

import xml.etree.ElementTree as ET
from xml.etree.ElementTree import fromstring
import pandas as pd


tree = ET.parse('sample.xml')
root = tree.getroot()

orglist = []
for child in root:
    orgdata = {}
    for sub in child:
        if sub.attrib.get('NAME') == 'TITLE':
            orgdata['TITLE'] = sub.text
        if sub.attrib.get('NAME') == 'ARTIST':
            orgdata['ARTIST'] = sub.text
        if sub.attrib.get('NAME') == 'COUNTRY':
            orgdata['COUNTRY'] = sub.text
        if sub.attrib.get('NAME') == 'COMPANY':
            orgdata['COMPANY'] = sub.text
        if sub.attrib.get('NAME') == 'PRICE':
            orgdata['PRICE'] = sub.text
        if sub.attrib.get('NAME') == 'YEAR':
            orgdata['YEAR'] = sub.text
        tocsv = orgdata
    orglist.append(orgdata)

df = pd.DataFrame(orglist)
df.drop_duplicates(inplace=True)
print(df)

这将打印:

         ARTIST      COMPANY COUNTRY  PRICE             TITLE  YEAR
0     Bob Dylan     Columbia     USA  10.90  Empire Burlesque  1985
1  Bonnie Tyler  CBS Records      UK   9.90   Hide your heart  1988
2  Dolly Parton          RCA     USA   9.90     Greatest Hits  1982