如何将嵌套的 JSON 下载到 pandas 数据框中

How to download a nested JSON into a pandas dataframe

希望提高我的数据科学技能。我正在练习从体育网站提取 url 数据,并且 json 文件有多个嵌套词典。我希望能够提取这些数据以在 matplotlib 等中映射我自己的排行榜自定义形式,但是我很难将 json 转换为可行的 df。

主要网站是:https://www.usopen.com/scoring.html

查看背景,我相信实时信息是从下面的短代码中列出的 link 中提取的。我在 Jupyter 笔记本上工作。我可以成功拉取数据。

但如您所见,它正在拉取多个嵌套字典,这使得拉取简单数据帧变得非常困难。

只是想找到球员,得分达到标准杆,总分,以及轮次。任何帮助将不胜感激,谢谢!

import pandas as pd
import urllib as ul
import json
url = "https://gripapi-static-pd.usopen.com/gripapi/leaderboard.json"
response = ul.request.urlopen(url)
data = json.loads(response.read())
print(data)

您可能想试试这个:

import requests
import pandas as pd


url = "https://gripapi-static-pd.usopen.com/gripapi/leaderboard.json"
data = pd.DataFrame.from_dict(requests.get(url).json()['standings'])

print(data['totalScore'])

输出:

0      {'value': 140, 'format': 'absolute', 'displayV...
1      {'value': 136, 'format': 'absolute', 'displayV...
2      {'value': 140, 'format': 'absolute', 'displayV...
3      {'value': 138, 'format': 'absolute', 'displayV...
4      {'value': 138, 'format': 'absolute', 'displayV...
                             ...                        

您确实需要编写一些自定义代码才能从 json 中获得您想要的内容。但是,如果您想将一些播放器详细信息放入 df 中,这里有一些灵感。

df = pd.DataFrame([x['player'] for x in data['standings']])
df['image'] = df['image'].apply(lambda x: x['identifier'])
df['country'] = df['country'].apply(lambda x: x['name'])
  • 使用requests.get(url).json()获取数据
  • 使用 pandas.json_normalizestandings 密钥解压到数据帧中
  • roundScores 是一个字典列表
    • 列表必须用 .explode
    • 展开
    • dicts 的列必须再次规范化
  • 将规范化的列连接回数据框 df
import requests
import pandas as pd

# load the data
df = pd.json_normalize(requests.get(url).json(), 'standings')

# explode the roundScores column
df = df.explode('roundScores').reset_index(drop=True)

# normalize the dicts in roundScores and join back to df
df = df.join(pd.json_normalize(df.roundScores), rsuffix='_rs').drop(columns=['roundScores']).reset_index(drop=True)

# display(df.head())
   isRecapAvailable player.identifier player.firstName player.lastName player.image.gravity player.image.type                     player.image.identifier player.image.cropMode player.country.name player.country.code player.country.flag.type player.country.flag.identifier  player.isAmateur  toPar.value toPar.format toPar.displayValue  toParToday.value toParToday.format toParToday.displayValue  totalScore.value totalScore.format totalScore.displayValue  position.value position.format position.displayValue  holesThrough.value holesThrough.format holesThrough.displayValue liveVideo.identifier liveVideo.isLive  score.value score.format score.displayValue  toPar.value_rs toPar.format_rs toPar.displayValue_rs
0              True             56278          Matthew           Wolff               center   imageCloudinary  us-open/players/2020-players/Matthew_Wolff                  fill       United States                 usa          imageCloudinary              us-open/flags/usa             False           -5     absolute                 -5                -5          absolute                      -5             140.0          absolute                     140               1        absolute                     1                  10            absolute                        10                  NaN              NaN           66     absolute                 66              -4        absolute                    -4
1              True             56278          Matthew           Wolff               center   imageCloudinary  us-open/players/2020-players/Matthew_Wolff                  fill       United States                 usa          imageCloudinary              us-open/flags/usa             False           -5     absolute                 -5                -5          absolute                      -5             140.0          absolute                     140               1        absolute                     1                  10            absolute                        10                  NaN              NaN           74     absolute                 74               4        absolute                    +4
2              True             56278          Matthew           Wolff               center   imageCloudinary  us-open/players/2020-players/Matthew_Wolff                  fill       United States                 usa          imageCloudinary              us-open/flags/usa             False           -5     absolute                 -5                -5          absolute                      -5             140.0          absolute                     140               1        absolute                     1                  10            absolute                        10                  NaN              NaN            0     absolute                                 -5        absolute                    -5
3              True             34360          Patrick            Reed               center   imageCloudinary   us-open/players/2019-players/Patrick-Reed                  fill       United States                 usa          imageCloudinary              us-open/flags/usa             False           -4     absolute                 -4                 0          absolute                       E             136.0          absolute                     136               2        absolute                     2                   7            absolute                         7                  NaN              NaN           66     absolute                 66              -4        absolute                    -4
4              True             34360          Patrick            Reed               center   imageCloudinary   us-open/players/2019-players/Patrick-Reed                  fill       United States                 usa          imageCloudinary              us-open/flags/usa             False           -4     absolute                 -4                 0          absolute                       E             136.0          absolute                     136               2        absolute                     2                   7            absolute                         7                  NaN              NaN           70     absolute                 70               0        absolute                     E

附加键

  • standings 只是下载的密钥之一 JSON
r = requests.get(url).json()

print(r)
[out]:
dict_keys(['currentRound', 'standings', 'fullLegend', 'shortLegend', 'inlineLegend', 'cutLine', 'meta'])

资源

  • Split / Explode a column of dictionaries into separate columns with pandas

简单快速的解决方案。 JSON normalize from pandas 可能存在更好的解决方案,但这对您的用例来说相当不错。

def func(x):
    if not any(x.isnull()):
        return (x['round'], x['player']['firstName'], x['player']['identifier'], x['toParToday']['value'], x['totalScore']['value'])

df = pd.DataFrame(data['standings'])
df['round'] = data['currentRound']['name']
df = df[['player', 'toPar', 'toParToday', 'totalScore', 'round']]
info = df.apply(func, axis=1)
info_df = pd.DataFrame(list(info.values), columns=['Round', 'player_name', 'pid', 'to_par_today', 'totalScore'])
info_df.head()