如何将嵌套的 JSON 下载到 pandas 数据框中
How to download a nested JSON into a pandas dataframe
希望提高我的数据科学技能。我正在练习从体育网站提取 url 数据,并且 json 文件有多个嵌套词典。我希望能够提取这些数据以在 matplotlib 等中映射我自己的排行榜自定义形式,但是我很难将 json 转换为可行的 df。
主要网站是:https://www.usopen.com/scoring.html
查看背景,我相信实时信息是从下面的短代码中列出的 link 中提取的。我在 Jupyter 笔记本上工作。我可以成功拉取数据。
但如您所见,它正在拉取多个嵌套字典,这使得拉取简单数据帧变得非常困难。
只是想找到球员,得分达到标准杆,总分,以及轮次。任何帮助将不胜感激,谢谢!
import pandas as pd
import urllib as ul
import json
url = "https://gripapi-static-pd.usopen.com/gripapi/leaderboard.json"
response = ul.request.urlopen(url)
data = json.loads(response.read())
print(data)
您可能想试试这个:
import requests
import pandas as pd
url = "https://gripapi-static-pd.usopen.com/gripapi/leaderboard.json"
data = pd.DataFrame.from_dict(requests.get(url).json()['standings'])
print(data['totalScore'])
输出:
0 {'value': 140, 'format': 'absolute', 'displayV...
1 {'value': 136, 'format': 'absolute', 'displayV...
2 {'value': 140, 'format': 'absolute', 'displayV...
3 {'value': 138, 'format': 'absolute', 'displayV...
4 {'value': 138, 'format': 'absolute', 'displayV...
...
您确实需要编写一些自定义代码才能从 json 中获得您想要的内容。但是,如果您想将一些播放器详细信息放入 df 中,这里有一些灵感。
df = pd.DataFrame([x['player'] for x in data['standings']])
df['image'] = df['image'].apply(lambda x: x['identifier'])
df['country'] = df['country'].apply(lambda x: x['name'])
- 使用
requests.get(url).json()
获取数据
- 使用
pandas.json_normalize
将 standings
密钥解压到数据帧中
roundScores
是一个字典列表
- 列表必须用
.explode
展开
- dicts 的列必须再次规范化
- 将规范化的列连接回数据框
df
import requests
import pandas as pd
# load the data
df = pd.json_normalize(requests.get(url).json(), 'standings')
# explode the roundScores column
df = df.explode('roundScores').reset_index(drop=True)
# normalize the dicts in roundScores and join back to df
df = df.join(pd.json_normalize(df.roundScores), rsuffix='_rs').drop(columns=['roundScores']).reset_index(drop=True)
# display(df.head())
isRecapAvailable player.identifier player.firstName player.lastName player.image.gravity player.image.type player.image.identifier player.image.cropMode player.country.name player.country.code player.country.flag.type player.country.flag.identifier player.isAmateur toPar.value toPar.format toPar.displayValue toParToday.value toParToday.format toParToday.displayValue totalScore.value totalScore.format totalScore.displayValue position.value position.format position.displayValue holesThrough.value holesThrough.format holesThrough.displayValue liveVideo.identifier liveVideo.isLive score.value score.format score.displayValue toPar.value_rs toPar.format_rs toPar.displayValue_rs
0 True 56278 Matthew Wolff center imageCloudinary us-open/players/2020-players/Matthew_Wolff fill United States usa imageCloudinary us-open/flags/usa False -5 absolute -5 -5 absolute -5 140.0 absolute 140 1 absolute 1 10 absolute 10 NaN NaN 66 absolute 66 -4 absolute -4
1 True 56278 Matthew Wolff center imageCloudinary us-open/players/2020-players/Matthew_Wolff fill United States usa imageCloudinary us-open/flags/usa False -5 absolute -5 -5 absolute -5 140.0 absolute 140 1 absolute 1 10 absolute 10 NaN NaN 74 absolute 74 4 absolute +4
2 True 56278 Matthew Wolff center imageCloudinary us-open/players/2020-players/Matthew_Wolff fill United States usa imageCloudinary us-open/flags/usa False -5 absolute -5 -5 absolute -5 140.0 absolute 140 1 absolute 1 10 absolute 10 NaN NaN 0 absolute -5 absolute -5
3 True 34360 Patrick Reed center imageCloudinary us-open/players/2019-players/Patrick-Reed fill United States usa imageCloudinary us-open/flags/usa False -4 absolute -4 0 absolute E 136.0 absolute 136 2 absolute 2 7 absolute 7 NaN NaN 66 absolute 66 -4 absolute -4
4 True 34360 Patrick Reed center imageCloudinary us-open/players/2019-players/Patrick-Reed fill United States usa imageCloudinary us-open/flags/usa False -4 absolute -4 0 absolute E 136.0 absolute 136 2 absolute 2 7 absolute 7 NaN NaN 70 absolute 70 0 absolute E
附加键
standings
只是下载的密钥之一 JSON
r = requests.get(url).json()
print(r)
[out]:
dict_keys(['currentRound', 'standings', 'fullLegend', 'shortLegend', 'inlineLegend', 'cutLine', 'meta'])
资源
- Split / Explode a column of dictionaries into separate columns with pandas
简单快速的解决方案。 JSON normalize from pandas 可能存在更好的解决方案,但这对您的用例来说相当不错。
def func(x):
if not any(x.isnull()):
return (x['round'], x['player']['firstName'], x['player']['identifier'], x['toParToday']['value'], x['totalScore']['value'])
df = pd.DataFrame(data['standings'])
df['round'] = data['currentRound']['name']
df = df[['player', 'toPar', 'toParToday', 'totalScore', 'round']]
info = df.apply(func, axis=1)
info_df = pd.DataFrame(list(info.values), columns=['Round', 'player_name', 'pid', 'to_par_today', 'totalScore'])
info_df.head()
希望提高我的数据科学技能。我正在练习从体育网站提取 url 数据,并且 json 文件有多个嵌套词典。我希望能够提取这些数据以在 matplotlib 等中映射我自己的排行榜自定义形式,但是我很难将 json 转换为可行的 df。
主要网站是:https://www.usopen.com/scoring.html
查看背景,我相信实时信息是从下面的短代码中列出的 link 中提取的。我在 Jupyter 笔记本上工作。我可以成功拉取数据。
但如您所见,它正在拉取多个嵌套字典,这使得拉取简单数据帧变得非常困难。
只是想找到球员,得分达到标准杆,总分,以及轮次。任何帮助将不胜感激,谢谢!
import pandas as pd
import urllib as ul
import json
url = "https://gripapi-static-pd.usopen.com/gripapi/leaderboard.json"
response = ul.request.urlopen(url)
data = json.loads(response.read())
print(data)
您可能想试试这个:
import requests
import pandas as pd
url = "https://gripapi-static-pd.usopen.com/gripapi/leaderboard.json"
data = pd.DataFrame.from_dict(requests.get(url).json()['standings'])
print(data['totalScore'])
输出:
0 {'value': 140, 'format': 'absolute', 'displayV...
1 {'value': 136, 'format': 'absolute', 'displayV...
2 {'value': 140, 'format': 'absolute', 'displayV...
3 {'value': 138, 'format': 'absolute', 'displayV...
4 {'value': 138, 'format': 'absolute', 'displayV...
...
您确实需要编写一些自定义代码才能从 json 中获得您想要的内容。但是,如果您想将一些播放器详细信息放入 df 中,这里有一些灵感。
df = pd.DataFrame([x['player'] for x in data['standings']])
df['image'] = df['image'].apply(lambda x: x['identifier'])
df['country'] = df['country'].apply(lambda x: x['name'])
- 使用
requests.get(url).json()
获取数据 - 使用
pandas.json_normalize
将standings
密钥解压到数据帧中 roundScores
是一个字典列表- 列表必须用
.explode
展开
- dicts 的列必须再次规范化
- 列表必须用
- 将规范化的列连接回数据框
df
import requests
import pandas as pd
# load the data
df = pd.json_normalize(requests.get(url).json(), 'standings')
# explode the roundScores column
df = df.explode('roundScores').reset_index(drop=True)
# normalize the dicts in roundScores and join back to df
df = df.join(pd.json_normalize(df.roundScores), rsuffix='_rs').drop(columns=['roundScores']).reset_index(drop=True)
# display(df.head())
isRecapAvailable player.identifier player.firstName player.lastName player.image.gravity player.image.type player.image.identifier player.image.cropMode player.country.name player.country.code player.country.flag.type player.country.flag.identifier player.isAmateur toPar.value toPar.format toPar.displayValue toParToday.value toParToday.format toParToday.displayValue totalScore.value totalScore.format totalScore.displayValue position.value position.format position.displayValue holesThrough.value holesThrough.format holesThrough.displayValue liveVideo.identifier liveVideo.isLive score.value score.format score.displayValue toPar.value_rs toPar.format_rs toPar.displayValue_rs
0 True 56278 Matthew Wolff center imageCloudinary us-open/players/2020-players/Matthew_Wolff fill United States usa imageCloudinary us-open/flags/usa False -5 absolute -5 -5 absolute -5 140.0 absolute 140 1 absolute 1 10 absolute 10 NaN NaN 66 absolute 66 -4 absolute -4
1 True 56278 Matthew Wolff center imageCloudinary us-open/players/2020-players/Matthew_Wolff fill United States usa imageCloudinary us-open/flags/usa False -5 absolute -5 -5 absolute -5 140.0 absolute 140 1 absolute 1 10 absolute 10 NaN NaN 74 absolute 74 4 absolute +4
2 True 56278 Matthew Wolff center imageCloudinary us-open/players/2020-players/Matthew_Wolff fill United States usa imageCloudinary us-open/flags/usa False -5 absolute -5 -5 absolute -5 140.0 absolute 140 1 absolute 1 10 absolute 10 NaN NaN 0 absolute -5 absolute -5
3 True 34360 Patrick Reed center imageCloudinary us-open/players/2019-players/Patrick-Reed fill United States usa imageCloudinary us-open/flags/usa False -4 absolute -4 0 absolute E 136.0 absolute 136 2 absolute 2 7 absolute 7 NaN NaN 66 absolute 66 -4 absolute -4
4 True 34360 Patrick Reed center imageCloudinary us-open/players/2019-players/Patrick-Reed fill United States usa imageCloudinary us-open/flags/usa False -4 absolute -4 0 absolute E 136.0 absolute 136 2 absolute 2 7 absolute 7 NaN NaN 70 absolute 70 0 absolute E
附加键
standings
只是下载的密钥之一 JSON
r = requests.get(url).json()
print(r)
[out]:
dict_keys(['currentRound', 'standings', 'fullLegend', 'shortLegend', 'inlineLegend', 'cutLine', 'meta'])
资源
- Split / Explode a column of dictionaries into separate columns with pandas
简单快速的解决方案。 JSON normalize from pandas 可能存在更好的解决方案,但这对您的用例来说相当不错。
def func(x):
if not any(x.isnull()):
return (x['round'], x['player']['firstName'], x['player']['identifier'], x['toParToday']['value'], x['totalScore']['value'])
df = pd.DataFrame(data['standings'])
df['round'] = data['currentRound']['name']
df = df[['player', 'toPar', 'toParToday', 'totalScore', 'round']]
info = df.apply(func, axis=1)
info_df = pd.DataFrame(list(info.values), columns=['Round', 'player_name', 'pid', 'to_par_today', 'totalScore'])
info_df.head()