Trying to parse through an xml file and put the text data into a dictionary using Python. keyerror: 0
Trying to parse through an xml file and put the text data into a dictionary using Python. keyerror: 0
我正在使用 Python elementTree 包解析 XML 文件(如下)。
<?xml version="1.0" encoding="Cp1252"?>
<CATALOG>
<CD>
<COLUMN NAME='TITLE'>Empire Burlesque</COLUMN>
<COLUMN NAME='ARTIST'>Bob Dylan</COLUMN>
<COLUMN NAME='COUNTRY'>USA</COLUMN>
<COLUMN NAME='COMPANY'>Columbia</COLUMN>
<COLUMN NAME='PRICE'>10.90</COLUMN>
<COLUMN NAME='YEAR'>1985</COLUMN>
</CD>
<CD>
<COLUMN NAME='TITLE'>Hide your heart</COLUMN>
<COLUMN NAME='ARTIST'>Bonnie Tyler</COLUMN>
<COLUMN NAME='COUNTRY'>UK</COLUMN>
<COLUMN NAME='COMPANY'>CBS Records</COLUMN>
<COLUMN NAME='PRICE'>9.90</COLUMN>
<COLUMN NAME='YEAR'>1988</COLUMN>
</CD>
<CD>
<COLUMN NAME='TITLE'>Greatest Hits</COLUMN>
<COLUMN NAME='ARTIST'>Dolly Parton</COLUMN>
<COLUMN NAME='COUNTRY'>USA</COLUMN>
<COLUMN NAME='COMPANY'>RCA</COLUMN>
<COLUMN NAME='PRICE'>9.90</COLUMN>
<COLUMN NAME='YEAR'>1982</COLUMN>
</CD>
</CATALOG>
现在,我想将每个属性(即 TITLE、ARTIST 等)的完整文本放入字典文件中,然后将文本写入 csv 文件中的每一行。下面是我的 python 程序。
import xml.etree.ElementTree as ET
from xml.etree.ElementTree import fromstring
import csv
tree = ET.parse('sample.xml')
root = tree.getroot()
orgdata = {}
orglist = []
csv_columns = ['TITLE','ARTIST','COUNTRY','COMPANY','PRICE','YEAR']
count = 0
for child in root:
for sub in child:
if sub.attrib.get('NAME') == 'TITLE':
orgdata['TITLE'] = sub.text
if sub.attrib.get('NAME') == 'ARTIST':
orgdata['ARTIST'] = sub.text
if sub.attrib.get('NAME') == 'COUNTRY':
orgdata['COUNTRY'] = sub.text
if sub.attrib.get('NAME') == 'COMPANY':
orgdata['COMPANY'] = sub.text
if sub.attrib.get('NAME') == 'PRICE':
orgdata['PRICE'] = sub.text
if sub.attrib.get('NAME') == 'YEAR':
orgdata['YEAR'] = sub.text
tocsv = orgdata
orglist.append(orgdata)
k = tocsv[0].keys()
with open('orgfile.txt','w+') as csvfile:
dic = csv.DictWriter(csvfile,k,delimiter='|',extrasaction='ignore')
dic.writeheader()
dic.writerows(tocsv)
此代码生成密钥 error:0
$ python sample.py Traceback (most recent call last):
File "sample.py", line 30, in <module>
k = tocsv[0].keys()
KeyError: 0
有没有办法解决这个问题并在没有重复的情况下将数据导入 CSV 文件?
也许可以使用 findall
:
稍微简化一下
In [20]: x = """
...: <CATALOG>
...: <CD>
...: <COLUMN NAME='TITLE'>Empire Burlesque</COLUMN>
...: <COLUMN NAME='ARTIST'>Bob Dylan</COLUMN>
...: <COLUMN NAME='COUNTRY'>USA</COLUMN>
...: <COLUMN NAME='COMPANY'>Columbia</COLUMN>
...: <COLUMN NAME='PRICE'>10.90</COLUMN>
...: <COLUMN NAME='YEAR'>1985</COLUMN>
...: </CD>
...: <CD>
...: <COLUMN NAME='TITLE'>Hide your heart</COLUMN>
...: <COLUMN NAME='ARTIST'>Bonnie Tyler</COLUMN>
...: <COLUMN NAME='COUNTRY'>UK</COLUMN>
...: <COLUMN NAME='COMPANY'>CBS Records</COLUMN>
...: <COLUMN NAME='PRICE'>9.90</COLUMN>
...: <COLUMN NAME='YEAR'>1988</COLUMN>
...: </CD>
...: <CD>
...: <COLUMN NAME='TITLE'>Greatest Hits</COLUMN>
...: <COLUMN NAME='ARTIST'>Dolly Parton</COLUMN>
...: <COLUMN NAME='COUNTRY'>USA</COLUMN>
...: <COLUMN NAME='COMPANY'>RCA</COLUMN>
...: <COLUMN NAME='PRICE'>9.90</COLUMN>
...: <COLUMN NAME='YEAR'>1982</COLUMN>
...: </CD>
...: </CATALOG>"""
In [21]:
In [21]: xdata = fromstring(x)
In [22]: results = []
In [23]: for cd in xdata.findall('.//CD'):
...: each_result = {}
...: for each in cd.findall('.//COLUMN'):
...: each_result[each.attrib.get('NAME')] = each.text
...: results.append(each_result)
这导致:
In [24]: results
Out[24]:
[{'TITLE': 'Empire Burlesque',
'ARTIST': 'Bob Dylan',
'COUNTRY': 'USA',
'COMPANY': 'Columbia',
'PRICE': '10.90',
'YEAR': '1985'},
{'TITLE': 'Hide your heart',
'ARTIST': 'Bonnie Tyler',
'COUNTRY': 'UK',
'COMPANY': 'CBS Records',
'PRICE': '9.90',
'YEAR': '1988'},
{'TITLE': 'Greatest Hits',
'ARTIST': 'Dolly Parton',
'COUNTRY': 'USA',
'COMPANY': 'RCA',
'PRICE': '9.90',
'YEAR': '1982'}]
首先,我认为您的意思是 orglist[0].keys()
而不是 tocsv[0].keys()
。这将解决您的错误。
根据你的第二个问题是:
Is there a way to fix this and get the data into the CSV file without the duplicates?
答案是肯定的,您可以使用 pandas.DataFrame
仅用三行代码就可以做到这一点,如下所示:
>>> import pandas as pd
>>> df = pd.DataFrame(orglist)
>>> df.drop_duplicates(inplace=True)
>>> print(df)
编辑
因此,您的代码应如下所示:
import xml.etree.ElementTree as ET
from xml.etree.ElementTree import fromstring
import pandas as pd
tree = ET.parse('sample.xml')
root = tree.getroot()
orglist = []
for child in root:
orgdata = {}
for sub in child:
if sub.attrib.get('NAME') == 'TITLE':
orgdata['TITLE'] = sub.text
if sub.attrib.get('NAME') == 'ARTIST':
orgdata['ARTIST'] = sub.text
if sub.attrib.get('NAME') == 'COUNTRY':
orgdata['COUNTRY'] = sub.text
if sub.attrib.get('NAME') == 'COMPANY':
orgdata['COMPANY'] = sub.text
if sub.attrib.get('NAME') == 'PRICE':
orgdata['PRICE'] = sub.text
if sub.attrib.get('NAME') == 'YEAR':
orgdata['YEAR'] = sub.text
tocsv = orgdata
orglist.append(orgdata)
df = pd.DataFrame(orglist)
df.drop_duplicates(inplace=True)
print(df)
这将打印:
ARTIST COMPANY COUNTRY PRICE TITLE YEAR
0 Bob Dylan Columbia USA 10.90 Empire Burlesque 1985
1 Bonnie Tyler CBS Records UK 9.90 Hide your heart 1988
2 Dolly Parton RCA USA 9.90 Greatest Hits 1982
我正在使用 Python elementTree 包解析 XML 文件(如下)。
<?xml version="1.0" encoding="Cp1252"?>
<CATALOG>
<CD>
<COLUMN NAME='TITLE'>Empire Burlesque</COLUMN>
<COLUMN NAME='ARTIST'>Bob Dylan</COLUMN>
<COLUMN NAME='COUNTRY'>USA</COLUMN>
<COLUMN NAME='COMPANY'>Columbia</COLUMN>
<COLUMN NAME='PRICE'>10.90</COLUMN>
<COLUMN NAME='YEAR'>1985</COLUMN>
</CD>
<CD>
<COLUMN NAME='TITLE'>Hide your heart</COLUMN>
<COLUMN NAME='ARTIST'>Bonnie Tyler</COLUMN>
<COLUMN NAME='COUNTRY'>UK</COLUMN>
<COLUMN NAME='COMPANY'>CBS Records</COLUMN>
<COLUMN NAME='PRICE'>9.90</COLUMN>
<COLUMN NAME='YEAR'>1988</COLUMN>
</CD>
<CD>
<COLUMN NAME='TITLE'>Greatest Hits</COLUMN>
<COLUMN NAME='ARTIST'>Dolly Parton</COLUMN>
<COLUMN NAME='COUNTRY'>USA</COLUMN>
<COLUMN NAME='COMPANY'>RCA</COLUMN>
<COLUMN NAME='PRICE'>9.90</COLUMN>
<COLUMN NAME='YEAR'>1982</COLUMN>
</CD>
</CATALOG>
现在,我想将每个属性(即 TITLE、ARTIST 等)的完整文本放入字典文件中,然后将文本写入 csv 文件中的每一行。下面是我的 python 程序。
import xml.etree.ElementTree as ET
from xml.etree.ElementTree import fromstring
import csv
tree = ET.parse('sample.xml')
root = tree.getroot()
orgdata = {}
orglist = []
csv_columns = ['TITLE','ARTIST','COUNTRY','COMPANY','PRICE','YEAR']
count = 0
for child in root:
for sub in child:
if sub.attrib.get('NAME') == 'TITLE':
orgdata['TITLE'] = sub.text
if sub.attrib.get('NAME') == 'ARTIST':
orgdata['ARTIST'] = sub.text
if sub.attrib.get('NAME') == 'COUNTRY':
orgdata['COUNTRY'] = sub.text
if sub.attrib.get('NAME') == 'COMPANY':
orgdata['COMPANY'] = sub.text
if sub.attrib.get('NAME') == 'PRICE':
orgdata['PRICE'] = sub.text
if sub.attrib.get('NAME') == 'YEAR':
orgdata['YEAR'] = sub.text
tocsv = orgdata
orglist.append(orgdata)
k = tocsv[0].keys()
with open('orgfile.txt','w+') as csvfile:
dic = csv.DictWriter(csvfile,k,delimiter='|',extrasaction='ignore')
dic.writeheader()
dic.writerows(tocsv)
此代码生成密钥 error:0
$ python sample.py Traceback (most recent call last):
File "sample.py", line 30, in <module>
k = tocsv[0].keys()
KeyError: 0
有没有办法解决这个问题并在没有重复的情况下将数据导入 CSV 文件?
也许可以使用 findall
:
In [20]: x = """
...: <CATALOG>
...: <CD>
...: <COLUMN NAME='TITLE'>Empire Burlesque</COLUMN>
...: <COLUMN NAME='ARTIST'>Bob Dylan</COLUMN>
...: <COLUMN NAME='COUNTRY'>USA</COLUMN>
...: <COLUMN NAME='COMPANY'>Columbia</COLUMN>
...: <COLUMN NAME='PRICE'>10.90</COLUMN>
...: <COLUMN NAME='YEAR'>1985</COLUMN>
...: </CD>
...: <CD>
...: <COLUMN NAME='TITLE'>Hide your heart</COLUMN>
...: <COLUMN NAME='ARTIST'>Bonnie Tyler</COLUMN>
...: <COLUMN NAME='COUNTRY'>UK</COLUMN>
...: <COLUMN NAME='COMPANY'>CBS Records</COLUMN>
...: <COLUMN NAME='PRICE'>9.90</COLUMN>
...: <COLUMN NAME='YEAR'>1988</COLUMN>
...: </CD>
...: <CD>
...: <COLUMN NAME='TITLE'>Greatest Hits</COLUMN>
...: <COLUMN NAME='ARTIST'>Dolly Parton</COLUMN>
...: <COLUMN NAME='COUNTRY'>USA</COLUMN>
...: <COLUMN NAME='COMPANY'>RCA</COLUMN>
...: <COLUMN NAME='PRICE'>9.90</COLUMN>
...: <COLUMN NAME='YEAR'>1982</COLUMN>
...: </CD>
...: </CATALOG>"""
In [21]:
In [21]: xdata = fromstring(x)
In [22]: results = []
In [23]: for cd in xdata.findall('.//CD'):
...: each_result = {}
...: for each in cd.findall('.//COLUMN'):
...: each_result[each.attrib.get('NAME')] = each.text
...: results.append(each_result)
这导致:
In [24]: results
Out[24]:
[{'TITLE': 'Empire Burlesque',
'ARTIST': 'Bob Dylan',
'COUNTRY': 'USA',
'COMPANY': 'Columbia',
'PRICE': '10.90',
'YEAR': '1985'},
{'TITLE': 'Hide your heart',
'ARTIST': 'Bonnie Tyler',
'COUNTRY': 'UK',
'COMPANY': 'CBS Records',
'PRICE': '9.90',
'YEAR': '1988'},
{'TITLE': 'Greatest Hits',
'ARTIST': 'Dolly Parton',
'COUNTRY': 'USA',
'COMPANY': 'RCA',
'PRICE': '9.90',
'YEAR': '1982'}]
首先,我认为您的意思是 orglist[0].keys()
而不是 tocsv[0].keys()
。这将解决您的错误。
根据你的第二个问题是:
Is there a way to fix this and get the data into the CSV file without the duplicates?
答案是肯定的,您可以使用 pandas.DataFrame
仅用三行代码就可以做到这一点,如下所示:
>>> import pandas as pd
>>> df = pd.DataFrame(orglist)
>>> df.drop_duplicates(inplace=True)
>>> print(df)
编辑
因此,您的代码应如下所示:
import xml.etree.ElementTree as ET
from xml.etree.ElementTree import fromstring
import pandas as pd
tree = ET.parse('sample.xml')
root = tree.getroot()
orglist = []
for child in root:
orgdata = {}
for sub in child:
if sub.attrib.get('NAME') == 'TITLE':
orgdata['TITLE'] = sub.text
if sub.attrib.get('NAME') == 'ARTIST':
orgdata['ARTIST'] = sub.text
if sub.attrib.get('NAME') == 'COUNTRY':
orgdata['COUNTRY'] = sub.text
if sub.attrib.get('NAME') == 'COMPANY':
orgdata['COMPANY'] = sub.text
if sub.attrib.get('NAME') == 'PRICE':
orgdata['PRICE'] = sub.text
if sub.attrib.get('NAME') == 'YEAR':
orgdata['YEAR'] = sub.text
tocsv = orgdata
orglist.append(orgdata)
df = pd.DataFrame(orglist)
df.drop_duplicates(inplace=True)
print(df)
这将打印:
ARTIST COMPANY COUNTRY PRICE TITLE YEAR
0 Bob Dylan Columbia USA 10.90 Empire Burlesque 1985
1 Bonnie Tyler CBS Records UK 9.90 Hide your heart 1988
2 Dolly Parton RCA USA 9.90 Greatest Hits 1982