如何在 Python 中规范化包含列表(应作为列表保存)的 json 文件 | Pandas?
How to normalize json file containing a list (that should be kept as a list) in Python | Pandas?
我正在尝试使用 json_normalize 函数将 json 文件转换为数据帧。
来源JSON
json 是一个看起来像这样的字典列表:
{
"sport_key": "basketball_ncaab",
"sport_nice": "NCAAB",
"teams": [
"Bryant Bulldogs",
"Wagner Seahawks"
],
"commence_time": 1608152400,
"home_team": "Bryant Bulldogs",
"sites": [
{
"site_key": "marathonbet",
"site_nice": "Marathon Bet",
"last_update": 1608156452,
"odds": {
"h2h": [
1.28,
3.54
]
}
},
{
"site_key": "sport888",
"site_nice": "888sport",
"last_update": 1608156452,
"odds": {
"h2h": [
1.13,
5.8
]
}
},
{
"site_key": "unibet",
"site_nice": "Unibet",
"last_update": 1608156434,
"odds": {
"h2h": [
1.13,
5.8
]
}
}
],
"sites_count": 3
}
问题是未来的列之一包含一个列表(应该是这样),但是在 json_normalize 函数的元部分中包含此列会引发以下错误:
ValueError: operands could not be broadcast together with shape (22,) (11,)
当我尝试在以下代码中的列表中添加“teams”时出现错误:
pd.json_normalize(data, 'sites', ['sport_key', 'sport_nice', 'home_team', 'teams'])
假设 data
是一个字典列表,您仍然可以使用 json_normalize
但您必须为 data
中的每个相应字典分别分配 teams
列:
def normalize(d):
return pd.json_normalize(d, 'sites', ['sport_key', 'sport_nice', 'home_team'])\
.assign(teams=[d['teams']]*len(d['sites']))
df = pd.concat([normalize(d) for d in data], ignore_index=True)
或者您可以尝试:
data = [{**d, 'teams': ','.join(d['teams'])} for d in data]
df = pd.json_normalize(data, 'sites', ['sport_key', 'sport_nice', 'home_team', 'teams'])
df['teams'] = df['teams'].str.split(',')
结果:
site_key site_nice last_update odds.h2h sport_key sport_nice home_team teams
0 marathonbet Marathon Bet 1608156452 [1.28, 3.54] basketball_ncaab NCAAB Bryant Bulldogs [Bryant Bulldogs, Wagner Seahawks]
1 sport888 888sport 1608156452 [1.13, 5.8] basketball_ncaab NCAAB Bryant Bulldogs [Bryant Bulldogs, Wagner Seahawks]
2 unibet Unibet 1608156434 [1.13, 5.8] basketball_ncaab NCAAB Bryant Bulldogs [Bryant Bulldogs, Wagner Seahawks]
我正在尝试使用 json_normalize 函数将 json 文件转换为数据帧。
来源JSON
json 是一个看起来像这样的字典列表:
{ "sport_key": "basketball_ncaab", "sport_nice": "NCAAB", "teams": [ "Bryant Bulldogs", "Wagner Seahawks" ], "commence_time": 1608152400, "home_team": "Bryant Bulldogs", "sites": [ { "site_key": "marathonbet", "site_nice": "Marathon Bet", "last_update": 1608156452, "odds": { "h2h": [ 1.28, 3.54 ] } }, { "site_key": "sport888", "site_nice": "888sport", "last_update": 1608156452, "odds": { "h2h": [ 1.13, 5.8 ] } }, { "site_key": "unibet", "site_nice": "Unibet", "last_update": 1608156434, "odds": { "h2h": [ 1.13, 5.8 ] } } ], "sites_count": 3 }
问题是未来的列之一包含一个列表(应该是这样),但是在 json_normalize 函数的元部分中包含此列会引发以下错误:
ValueError: operands could not be broadcast together with shape (22,) (11,)
当我尝试在以下代码中的列表中添加“teams”时出现错误:
pd.json_normalize(data, 'sites', ['sport_key', 'sport_nice', 'home_team', 'teams'])
假设 data
是一个字典列表,您仍然可以使用 json_normalize
但您必须为 data
中的每个相应字典分别分配 teams
列:
def normalize(d):
return pd.json_normalize(d, 'sites', ['sport_key', 'sport_nice', 'home_team'])\
.assign(teams=[d['teams']]*len(d['sites']))
df = pd.concat([normalize(d) for d in data], ignore_index=True)
或者您可以尝试:
data = [{**d, 'teams': ','.join(d['teams'])} for d in data]
df = pd.json_normalize(data, 'sites', ['sport_key', 'sport_nice', 'home_team', 'teams'])
df['teams'] = df['teams'].str.split(',')
结果:
site_key site_nice last_update odds.h2h sport_key sport_nice home_team teams
0 marathonbet Marathon Bet 1608156452 [1.28, 3.54] basketball_ncaab NCAAB Bryant Bulldogs [Bryant Bulldogs, Wagner Seahawks]
1 sport888 888sport 1608156452 [1.13, 5.8] basketball_ncaab NCAAB Bryant Bulldogs [Bryant Bulldogs, Wagner Seahawks]
2 unibet Unibet 1608156434 [1.13, 5.8] basketball_ncaab NCAAB Bryant Bulldogs [Bryant Bulldogs, Wagner Seahawks]