使用列表展平数据框中的 JSON 列
Flatten JSON columns in a dataframe with lists
我在数据框列中有一个 JSON 作为:
x = '''{"sections":
[{
"id": "12ab",
"items": [
{"id": "34cd",
"isValid": true,
"questionaire": {"title": "blah blah", "question": "Date of Purchase"}
},
{"id": "56ef",
"isValid": true,
"questionaire": {"title": "something useless", "question": "Date of Billing"}
}
]
}],
"ignore": "yes"}'''
我想要 id,项目列表中的内部 id 和问卷中的问题 json:
我能够使用以下代码提取信息:
df_norm = json_normalize(json.loads(x)['sections'])
df_norm = df_norm[['id', 'items']]
df1 = (pd.concat({k: pd.DataFrame(v) for k, v in df_norm.pop('items').items()}).reset_index(level=1, drop=True))
df = df_norm.join(df1, rsuffix='_').reset_index(drop=True)
df['child_id'] = df.pop('id_')
df = df[['id', 'child_id', 'questionaire']]
df.questionaire = df.questionaire.fillna({i: {} for i in df.index})
idx = df.set_index(['id', 'child_id']).questionaire.index
result = pd.DataFrame(df.
set_index(['id', 'child_id']).
questionaire.values.tolist(),index=idx).reset_index()
result = result[['id','child_id','question']]
result
结果数据框如下所示。你可以运行它来验证:
id
child_id
question
0
12ab
34cd
Date of Purchase
1
12ab
56ef
Date of Billing
我的问题是要使它与 Dataframe 一起工作,其中上面共享的 json 值本身就是一列。我实际拥有的输入如下所示:
id
name
location
flatten
1
xyz
new york
the json 'x' above
当我必须对多个 JSON 作为列值进行绑定时,我无法将其绑定。
我想要的最终结果 DataFrame 是:
Masterid
name
location
id
child_id
question
1
xyz
new york
12ab
34cd
Date of Pruchase
1
xyz
new york
12ab
56ef
Date of Billing
想法是使用字典推导式 flatten
列 i
作为索引值,因此在 concat
之后可以连接到原始 DataFrame:
x = '''{"sections":
[{
"id": "12ab",
"items": [
{"id": "34cd",
"isValid": true,
"questionaire": {"title": "blah blah", "question": "Date of Purchase"}
},
{"id": "56ef",
"isValid": true,
"questionaire": {"title": "something useless", "question": "Date of Billing"}
}
]
}],
"ignore": "yes"}'''
df = pd.DataFrame({'id':['1','2'], 'name':['xyz', 'abc'],
'location':['new york', 'wien'], 'flatten':[x,x]})
#create default RangeIndex
df = df.reset_index(drop=True)
d = {i: pd.json_normalize(json.loads(x)['sections'],
'items', ['id'],
record_prefix='child_')[['id','child_id','child_questionaire.question']]
.rename(columns={'child_questionaire.question':'question'})
for i, x in df.pop('flatten').items()}
df_norm = df.rename(columns={'id':'Masterid'}).join(pd.concat(d).reset_index(level=1, drop=True))
print (df_norm)
Masterid name location id child_id question
0 1 xyz new york 12ab 34cd Date of Purchase
0 1 xyz new york 12ab 56ef Date of Billing
1 2 abc wien 12ab 34cd Date of Purchase
1 2 abc wien 12ab 56ef Date of Billing
我在数据框列中有一个 JSON 作为:
x = '''{"sections":
[{
"id": "12ab",
"items": [
{"id": "34cd",
"isValid": true,
"questionaire": {"title": "blah blah", "question": "Date of Purchase"}
},
{"id": "56ef",
"isValid": true,
"questionaire": {"title": "something useless", "question": "Date of Billing"}
}
]
}],
"ignore": "yes"}'''
我想要 id,项目列表中的内部 id 和问卷中的问题 json:
我能够使用以下代码提取信息:
df_norm = json_normalize(json.loads(x)['sections'])
df_norm = df_norm[['id', 'items']]
df1 = (pd.concat({k: pd.DataFrame(v) for k, v in df_norm.pop('items').items()}).reset_index(level=1, drop=True))
df = df_norm.join(df1, rsuffix='_').reset_index(drop=True)
df['child_id'] = df.pop('id_')
df = df[['id', 'child_id', 'questionaire']]
df.questionaire = df.questionaire.fillna({i: {} for i in df.index})
idx = df.set_index(['id', 'child_id']).questionaire.index
result = pd.DataFrame(df.
set_index(['id', 'child_id']).
questionaire.values.tolist(),index=idx).reset_index()
result = result[['id','child_id','question']]
result
结果数据框如下所示。你可以运行它来验证:
id | child_id | question | |
---|---|---|---|
0 | 12ab | 34cd | Date of Purchase |
1 | 12ab | 56ef | Date of Billing |
我的问题是要使它与 Dataframe 一起工作,其中上面共享的 json 值本身就是一列。我实际拥有的输入如下所示:
id | name | location | flatten |
---|---|---|---|
1 | xyz | new york | the json 'x' above |
当我必须对多个 JSON 作为列值进行绑定时,我无法将其绑定。
我想要的最终结果 DataFrame 是:
Masterid | name | location | id | child_id | question |
---|---|---|---|---|---|
1 | xyz | new york | 12ab | 34cd | Date of Pruchase |
1 | xyz | new york | 12ab | 56ef | Date of Billing |
想法是使用字典推导式 flatten
列 i
作为索引值,因此在 concat
之后可以连接到原始 DataFrame:
x = '''{"sections":
[{
"id": "12ab",
"items": [
{"id": "34cd",
"isValid": true,
"questionaire": {"title": "blah blah", "question": "Date of Purchase"}
},
{"id": "56ef",
"isValid": true,
"questionaire": {"title": "something useless", "question": "Date of Billing"}
}
]
}],
"ignore": "yes"}'''
df = pd.DataFrame({'id':['1','2'], 'name':['xyz', 'abc'],
'location':['new york', 'wien'], 'flatten':[x,x]})
#create default RangeIndex
df = df.reset_index(drop=True)
d = {i: pd.json_normalize(json.loads(x)['sections'],
'items', ['id'],
record_prefix='child_')[['id','child_id','child_questionaire.question']]
.rename(columns={'child_questionaire.question':'question'})
for i, x in df.pop('flatten').items()}
df_norm = df.rename(columns={'id':'Masterid'}).join(pd.concat(d).reset_index(level=1, drop=True))
print (df_norm)
Masterid name location id child_id question
0 1 xyz new york 12ab 34cd Date of Purchase
0 1 xyz new york 12ab 56ef Date of Billing
1 2 abc wien 12ab 34cd Date of Purchase
1 2 abc wien 12ab 56ef Date of Billing