解析 Python 中的深层嵌套 XML 文件
Parsing through a deep-nested XML File in Python
我正在查看类似于以下内容的 xml 文件:
<pinnacle_line_feed>
<PinnacleFeedTime>1418929691920</PinnacleFeedTime>
<lastContest>28962804</lastContest>
<lastGame>162995589</lastGame>
<events>
<event>
<event_datetimeGMT>2014-12-19 11:15</event_datetimeGMT>
<gamenumber>422739932</gamenumber>
<sporttype>Alpine Skiing</sporttype>
<league>DH 145</league>
<IsLive>No</IsLive>
<participants>
<participant>
<participant_name>Kjetil Jansrud (NOR)</participant_name>
<contestantnum>2001</contestantnum>
<rotnum>2001</rotnum>
<visiting_home_draw>Visiting</visiting_home_draw>
</participant>
<participant>
<participant_name>The Field</participant_name>
<contestantnum>2002</contestantnum>
<rotnum>2002</rotnum>
<visiting_home_draw>Home</visiting_home_draw>
</participant>
</participants>
<periods>
<period>
<period_number>0</period_number>
<period_description>Matchups</period_description>
<periodcutoff_datetimeGMT>2014-12-19 11:15</periodcutoff_datetimeGMT>
<period_status>I</period_status>
<period_update>open</period_update>
<spread_maximum>200</spread_maximum>
<moneyline_maximum>100</moneyline_maximum>
<total_maximum>200</total_maximum>
<moneyline>
<moneyline_visiting>116</moneyline_visiting>
<moneyline_home>-136</moneyline_home>
</moneyline>
</period>
</periods>
<PinnacleFeedTime>1418929691920</PinnacleFeedTime>
</event>
</events>
</pinnacle_line_feed>
我已经用下面的代码解析了文件:
pinny_url = 'http://xml.pinnaclesports.com/pinnacleFeed.aspx?sportType=Basketball'
tree = ET.parse(urllib.urlopen(pinny_url))
root = tree.getroot()
list = []
for event in root.iter('event'):
event_datetimeGMT = event.find('event_datetimeGMT').text
gamenumber = event.find('gamenumber').text
sporttype = event.find('sporttype').text
league = event.find('league').text
IsLive = event.find('IsLive').text
for participants in event.iter('participants'):
for participant in participants.iter('participant'):
p1_name = participant.find('participant_name').text
contestantnum = participant.find('contestantnum').text
rotnum = participant.find('rotnum').text
vhd = participant.find('visiting_home_draw').text
for periods in event.iter('periods'):
for period in periods.iter('period'):
period_number = period.find('period_number').text
desc = period.find('period_description').text
pdatetime = period.find('periodcutoff_datetimeGMT')
status = period.find('period_status').text
update = period.find('period_update').text
max = period.find('spread_maximum').text
mlmax = period.find('moneyline_maximum').text
tot_max = period.find('total_maximum').text
for moneyline in period.iter('moneyline'):
ml_vis = moneyline.find('moneyline_visiting').text
ml_home = moneyline.find('moneyline_home').text
但是,我希望通过类似于 2D table 的事件分隔节点(如在 pandas 数据帧中)。但是,完整的 xml 文件有多个 "event" 个子项,其中一些事件不共享上述相同的节点。我非常努力地能够获取每个事件节点并简单地创建一个带有标签的 2d table 和该值,其中标签充当列名,文本充当值。
到目前为止,我已经完成了上述操作以评估如何将这些信息放入字典中,然后将一些字典放入列表中,我可以使用 pandas 从中创建数据框,但这并没有奏效,因为所有尝试都要求我查找并替换文本以创建 dxcictionaries,而 python 在随后尝试创建数据框时对此反应不佳。我也使用了一个简单的:
for elt in tree.iter():
list.append("'%s': '%s'") % (elt.tag, elt.text.strip()))
这在简单地提取每个标签和相应的文本方面效果很好,但我无法做任何事情,因为任何试图查找和替换文本以创建字典的尝试都没有用。
如有任何帮助,我们将不胜感激。
谢谢。
这是将 XML 放入 pandas 数据框的简单方法。这利用了很棒的 requests
库(如果你愿意,你可以切换到 urllib,以及 pypi 中可用的总是有用的 xmltodict
库。(注意:反向库也可用,知道作为 dicttoxml
)
import json
import pandas
import requests
import xmltodict
web_request = requests.get(u'http://xml.pinnaclesports.com/pinnacleFeed.aspx?sportType=Basketball')
# Make that unweidly XML doc look like a native Dictionary!
result = xmltodict.parse(web_request.text)
# Next, convert the nested OrderedDict to a real dict, which isn't strictly necessary, but helps you
# visualize what the structure of the data looks like
normal_dict = json.loads(json.dumps(result.get('pinnacle_line_feed', {}).get(u'events', {}).get(u'event', [])))
# Now, make that dictionary into a dataframe
df = pandas.DataFrame.from_dict(normal_dict)
为了了解这开始是什么样子,这里是 CSV 的前几行:
>>> from StringIO import StringIO
>>> foo = StringIO() # A fake file to write to
>>> df.to_csv(foo) # Output the df to a CSV file
>>> foo.seek(0) # And rewind the file to the beginning
>>> print ''.join(foo.readlines()[:3])
,IsLive,event_datetimeGMT,gamenumber,league,participants,periods,sporttype
0,No,2015-01-10 23:00,426688683,Argentinian,"{u'participant': [{u'contestantnum': u'1071', u'rotnum': u'1071', u'visiting_home_draw': u'Home', u'participant_name': u'Obras Sanitarias'}, {u'contestantnum': u'1072', u'rotnum': u'1072', u'visiting_home_draw': u'Visiting', u'participant_name': u'Libertad'}]}",,Basketball
1,No,2015-01-06 23:00,426686588,Argentinian,"{u'participant': [{u'contestantnum': u'1079', u'rotnum': u'1079', u'visiting_home_draw': u'Home', u'participant_name': u'Boca Juniors'}, {u'contestantnum': u'1080', u'rotnum': u'1080', u'visiting_home_draw': u'Visiting', u'participant_name': u'Penarol'}]}","{u'period': {u'total_maximum': u'450', u'total': {u'total_points': u'152.5', u'under_adjust': u'-107', u'over_adjust': u'-103'}, u'spread_maximum': u'450', u'period_description': u'Game', u'moneyline_maximum': u'450', u'period_number': u'0', u'period_status': u'I', u'spread': {u'spread_visiting': u'3', u'spread_adjust_visiting': u'-102', u'spread_home': u'-3', u'spread_adjust_home': u'-108'}, u'periodcutoff_datetimeGMT': u'2015-01-06 23:00', u'moneyline': {u'moneyline_visiting': u'136', u'moneyline_home': u'-150'}, u'period_update': u'open'}}",Basketball
请注意,participants
和 periods
列仍然是它们的原生 Python 词典。您需要将它们从列列表中删除,或者进行一些额外的处理以使它们变平:
# Remove the offending columns in this example by selecting particular columns to show
>>> from StringIO import StringIO
>>> foo = StringIO() # A fake file to write to
>>> df.to_csv(foo, cols=['IsLive', 'event_datetimeGMT', 'gamenumber', 'league', 'sporttype'])
>>> foo.seek(0) # And rewind the file to the beginning
>>> print ''.join(foo.readlines()[:3])
,IsLive,event_datetimeGMT,gamenumber,league,sporttype
0,No,2015-01-10 23:00,426688683,Argentinian,Basketball
1,No,2015-01-06 23:00,426686588,Argentinian,Basketball
我正在查看类似于以下内容的 xml 文件:
<pinnacle_line_feed>
<PinnacleFeedTime>1418929691920</PinnacleFeedTime>
<lastContest>28962804</lastContest>
<lastGame>162995589</lastGame>
<events>
<event>
<event_datetimeGMT>2014-12-19 11:15</event_datetimeGMT>
<gamenumber>422739932</gamenumber>
<sporttype>Alpine Skiing</sporttype>
<league>DH 145</league>
<IsLive>No</IsLive>
<participants>
<participant>
<participant_name>Kjetil Jansrud (NOR)</participant_name>
<contestantnum>2001</contestantnum>
<rotnum>2001</rotnum>
<visiting_home_draw>Visiting</visiting_home_draw>
</participant>
<participant>
<participant_name>The Field</participant_name>
<contestantnum>2002</contestantnum>
<rotnum>2002</rotnum>
<visiting_home_draw>Home</visiting_home_draw>
</participant>
</participants>
<periods>
<period>
<period_number>0</period_number>
<period_description>Matchups</period_description>
<periodcutoff_datetimeGMT>2014-12-19 11:15</periodcutoff_datetimeGMT>
<period_status>I</period_status>
<period_update>open</period_update>
<spread_maximum>200</spread_maximum>
<moneyline_maximum>100</moneyline_maximum>
<total_maximum>200</total_maximum>
<moneyline>
<moneyline_visiting>116</moneyline_visiting>
<moneyline_home>-136</moneyline_home>
</moneyline>
</period>
</periods>
<PinnacleFeedTime>1418929691920</PinnacleFeedTime>
</event>
</events>
</pinnacle_line_feed>
我已经用下面的代码解析了文件:
pinny_url = 'http://xml.pinnaclesports.com/pinnacleFeed.aspx?sportType=Basketball'
tree = ET.parse(urllib.urlopen(pinny_url))
root = tree.getroot()
list = []
for event in root.iter('event'):
event_datetimeGMT = event.find('event_datetimeGMT').text
gamenumber = event.find('gamenumber').text
sporttype = event.find('sporttype').text
league = event.find('league').text
IsLive = event.find('IsLive').text
for participants in event.iter('participants'):
for participant in participants.iter('participant'):
p1_name = participant.find('participant_name').text
contestantnum = participant.find('contestantnum').text
rotnum = participant.find('rotnum').text
vhd = participant.find('visiting_home_draw').text
for periods in event.iter('periods'):
for period in periods.iter('period'):
period_number = period.find('period_number').text
desc = period.find('period_description').text
pdatetime = period.find('periodcutoff_datetimeGMT')
status = period.find('period_status').text
update = period.find('period_update').text
max = period.find('spread_maximum').text
mlmax = period.find('moneyline_maximum').text
tot_max = period.find('total_maximum').text
for moneyline in period.iter('moneyline'):
ml_vis = moneyline.find('moneyline_visiting').text
ml_home = moneyline.find('moneyline_home').text
但是,我希望通过类似于 2D table 的事件分隔节点(如在 pandas 数据帧中)。但是,完整的 xml 文件有多个 "event" 个子项,其中一些事件不共享上述相同的节点。我非常努力地能够获取每个事件节点并简单地创建一个带有标签的 2d table 和该值,其中标签充当列名,文本充当值。
到目前为止,我已经完成了上述操作以评估如何将这些信息放入字典中,然后将一些字典放入列表中,我可以使用 pandas 从中创建数据框,但这并没有奏效,因为所有尝试都要求我查找并替换文本以创建 dxcictionaries,而 python 在随后尝试创建数据框时对此反应不佳。我也使用了一个简单的:
for elt in tree.iter():
list.append("'%s': '%s'") % (elt.tag, elt.text.strip()))
这在简单地提取每个标签和相应的文本方面效果很好,但我无法做任何事情,因为任何试图查找和替换文本以创建字典的尝试都没有用。
如有任何帮助,我们将不胜感激。
谢谢。
这是将 XML 放入 pandas 数据框的简单方法。这利用了很棒的 requests
库(如果你愿意,你可以切换到 urllib,以及 pypi 中可用的总是有用的 xmltodict
库。(注意:反向库也可用,知道作为 dicttoxml
)
import json
import pandas
import requests
import xmltodict
web_request = requests.get(u'http://xml.pinnaclesports.com/pinnacleFeed.aspx?sportType=Basketball')
# Make that unweidly XML doc look like a native Dictionary!
result = xmltodict.parse(web_request.text)
# Next, convert the nested OrderedDict to a real dict, which isn't strictly necessary, but helps you
# visualize what the structure of the data looks like
normal_dict = json.loads(json.dumps(result.get('pinnacle_line_feed', {}).get(u'events', {}).get(u'event', [])))
# Now, make that dictionary into a dataframe
df = pandas.DataFrame.from_dict(normal_dict)
为了了解这开始是什么样子,这里是 CSV 的前几行:
>>> from StringIO import StringIO
>>> foo = StringIO() # A fake file to write to
>>> df.to_csv(foo) # Output the df to a CSV file
>>> foo.seek(0) # And rewind the file to the beginning
>>> print ''.join(foo.readlines()[:3])
,IsLive,event_datetimeGMT,gamenumber,league,participants,periods,sporttype
0,No,2015-01-10 23:00,426688683,Argentinian,"{u'participant': [{u'contestantnum': u'1071', u'rotnum': u'1071', u'visiting_home_draw': u'Home', u'participant_name': u'Obras Sanitarias'}, {u'contestantnum': u'1072', u'rotnum': u'1072', u'visiting_home_draw': u'Visiting', u'participant_name': u'Libertad'}]}",,Basketball
1,No,2015-01-06 23:00,426686588,Argentinian,"{u'participant': [{u'contestantnum': u'1079', u'rotnum': u'1079', u'visiting_home_draw': u'Home', u'participant_name': u'Boca Juniors'}, {u'contestantnum': u'1080', u'rotnum': u'1080', u'visiting_home_draw': u'Visiting', u'participant_name': u'Penarol'}]}","{u'period': {u'total_maximum': u'450', u'total': {u'total_points': u'152.5', u'under_adjust': u'-107', u'over_adjust': u'-103'}, u'spread_maximum': u'450', u'period_description': u'Game', u'moneyline_maximum': u'450', u'period_number': u'0', u'period_status': u'I', u'spread': {u'spread_visiting': u'3', u'spread_adjust_visiting': u'-102', u'spread_home': u'-3', u'spread_adjust_home': u'-108'}, u'periodcutoff_datetimeGMT': u'2015-01-06 23:00', u'moneyline': {u'moneyline_visiting': u'136', u'moneyline_home': u'-150'}, u'period_update': u'open'}}",Basketball
请注意,participants
和 periods
列仍然是它们的原生 Python 词典。您需要将它们从列列表中删除,或者进行一些额外的处理以使它们变平:
# Remove the offending columns in this example by selecting particular columns to show
>>> from StringIO import StringIO
>>> foo = StringIO() # A fake file to write to
>>> df.to_csv(foo, cols=['IsLive', 'event_datetimeGMT', 'gamenumber', 'league', 'sporttype'])
>>> foo.seek(0) # And rewind the file to the beginning
>>> print ''.join(foo.readlines()[:3])
,IsLive,event_datetimeGMT,gamenumber,league,sporttype
0,No,2015-01-10 23:00,426688683,Argentinian,Basketball
1,No,2015-01-06 23:00,426686588,Argentinian,Basketball