解析 Python 中的深层嵌套 XML 文件

Parsing through a deep-nested XML File in Python

我正在查看类似于以下内容的 xml 文件:

<pinnacle_line_feed>
  <PinnacleFeedTime>1418929691920</PinnacleFeedTime>
  <lastContest>28962804</lastContest>
  <lastGame>162995589</lastGame>
  <events>
    <event>
      <event_datetimeGMT>2014-12-19 11:15</event_datetimeGMT>
      <gamenumber>422739932</gamenumber>
      <sporttype>Alpine Skiing</sporttype>
      <league>DH 145</league>
      <IsLive>No</IsLive>
      <participants>
        <participant>
          <participant_name>Kjetil Jansrud (NOR)</participant_name>
          <contestantnum>2001</contestantnum>
          <rotnum>2001</rotnum>
          <visiting_home_draw>Visiting</visiting_home_draw>
        </participant>
        <participant>
          <participant_name>The Field</participant_name>
          <contestantnum>2002</contestantnum>
          <rotnum>2002</rotnum>
          <visiting_home_draw>Home</visiting_home_draw>
        </participant>
      </participants>
      <periods>
        <period>
          <period_number>0</period_number>
          <period_description>Matchups</period_description>
          <periodcutoff_datetimeGMT>2014-12-19 11:15</periodcutoff_datetimeGMT>
          <period_status>I</period_status>
          <period_update>open</period_update>
          <spread_maximum>200</spread_maximum>
          <moneyline_maximum>100</moneyline_maximum>
          <total_maximum>200</total_maximum>
          <moneyline>
            <moneyline_visiting>116</moneyline_visiting>
            <moneyline_home>-136</moneyline_home>
          </moneyline>
        </period>
      </periods>
      <PinnacleFeedTime>1418929691920</PinnacleFeedTime>
    </event>
  </events>
</pinnacle_line_feed>

我已经用下面的代码解析了文件:

pinny_url = 'http://xml.pinnaclesports.com/pinnacleFeed.aspx?sportType=Basketball'

tree = ET.parse(urllib.urlopen(pinny_url))
root = tree.getroot()
list = []

for event in root.iter('event'):
    event_datetimeGMT = event.find('event_datetimeGMT').text
    gamenumber = event.find('gamenumber').text
    sporttype = event.find('sporttype').text
    league = event.find('league').text
    IsLive = event.find('IsLive').text
    for participants in event.iter('participants'):
        for participant in participants.iter('participant'):
            p1_name = participant.find('participant_name').text
            contestantnum  = participant.find('contestantnum').text
            rotnum = participant.find('rotnum').text
            vhd = participant.find('visiting_home_draw').text
    for periods in event.iter('periods'):
        for period in periods.iter('period'):
            period_number = period.find('period_number').text
            desc = period.find('period_description').text
            pdatetime = period.find('periodcutoff_datetimeGMT')
            status = period.find('period_status').text
            update = period.find('period_update').text
            max = period.find('spread_maximum').text
            mlmax = period.find('moneyline_maximum').text
            tot_max = period.find('total_maximum').text
            for moneyline in period.iter('moneyline'):
                ml_vis = moneyline.find('moneyline_visiting').text
                ml_home = moneyline.find('moneyline_home').text

但是,我希望通过类似于 2D table 的事件分隔节点(如在 pandas 数据帧中)。但是,完整的 xml 文件有多个 "event" 个子项,其中一些事件不共享上述相同的节点。我非常努力地能够获取每个事件节点并简单地创建一个带有标签的 2d table 和该值,其中标签充当列名,文本充当值。

到目前为止,我已经完成了上述操作以评估如何将这些信息放入字典中,然后将一些字典放入列表中,我可以使用 pandas 从中创建数据框,但这并没有奏效,因为所有尝试都要求我查找并替换文本以创建 dxcictionaries,而 python 在随后尝试创建数据框时对此反应不佳。我也使用了一个简单的:

for elt in tree.iter():
  list.append("'%s': '%s'") % (elt.tag, elt.text.strip()))

这在简单地提取每个标签和相应的文本方面效果很好,但我无法做任何事情,因为任何试图查找和替换文本以创建字典的尝试都没有用。

如有任何帮助,我们将不胜感激。

谢谢。

这是将 XML 放入 pandas 数据框的简单方法。这利用了很棒的 requests 库(如果你愿意,你可以切换到 urllib,以及 pypi 中可用的总是有用的 xmltodict 库。(注意:反向库也可用,知道作为 dicttoxml)

import json
import pandas
import requests
import xmltodict

web_request = requests.get(u'http://xml.pinnaclesports.com/pinnacleFeed.aspx?sportType=Basketball')

# Make that unweidly XML doc look like a native Dictionary!
result = xmltodict.parse(web_request.text)

# Next, convert the nested OrderedDict to a real dict, which isn't strictly necessary, but helps you
#   visualize what the structure of the data looks like
normal_dict = json.loads(json.dumps(result.get('pinnacle_line_feed', {}).get(u'events', {}).get(u'event', [])))

# Now, make that dictionary into a dataframe
df = pandas.DataFrame.from_dict(normal_dict)

为了了解这开始是什么样子,这里是 CSV 的前几行:

>>> from StringIO import StringIO
>>> foo = StringIO()  # A fake file to write to
>>> df.to_csv(foo)  # Output the df to a CSV file
>>> foo.seek(0)  # And rewind the file to the beginning
>>> print ''.join(foo.readlines()[:3])
,IsLive,event_datetimeGMT,gamenumber,league,participants,periods,sporttype
0,No,2015-01-10 23:00,426688683,Argentinian,"{u'participant': [{u'contestantnum': u'1071', u'rotnum': u'1071', u'visiting_home_draw': u'Home', u'participant_name': u'Obras Sanitarias'}, {u'contestantnum': u'1072', u'rotnum': u'1072', u'visiting_home_draw': u'Visiting', u'participant_name': u'Libertad'}]}",,Basketball
1,No,2015-01-06 23:00,426686588,Argentinian,"{u'participant': [{u'contestantnum': u'1079', u'rotnum': u'1079', u'visiting_home_draw': u'Home', u'participant_name': u'Boca Juniors'}, {u'contestantnum': u'1080', u'rotnum': u'1080', u'visiting_home_draw': u'Visiting', u'participant_name': u'Penarol'}]}","{u'period': {u'total_maximum': u'450', u'total': {u'total_points': u'152.5', u'under_adjust': u'-107', u'over_adjust': u'-103'}, u'spread_maximum': u'450', u'period_description': u'Game', u'moneyline_maximum': u'450', u'period_number': u'0', u'period_status': u'I', u'spread': {u'spread_visiting': u'3', u'spread_adjust_visiting': u'-102', u'spread_home': u'-3', u'spread_adjust_home': u'-108'}, u'periodcutoff_datetimeGMT': u'2015-01-06 23:00', u'moneyline': {u'moneyline_visiting': u'136', u'moneyline_home': u'-150'}, u'period_update': u'open'}}",Basketball

请注意,participantsperiods 列仍然是它们的原生 Python 词典。您需要将它们从列列表中删除,或者进行一些额外的处理以使它们变平:

# Remove the offending columns in this example by selecting particular columns to show
>>> from StringIO import StringIO
>>> foo = StringIO()  # A fake file to write to
>>> df.to_csv(foo, cols=['IsLive', 'event_datetimeGMT', 'gamenumber', 'league', 'sporttype'])
>>> foo.seek(0)  # And rewind the file to the beginning
>>> print ''.join(foo.readlines()[:3])
,IsLive,event_datetimeGMT,gamenumber,league,sporttype
0,No,2015-01-10 23:00,426688683,Argentinian,Basketball
1,No,2015-01-06 23:00,426686588,Argentinian,Basketball