使用 pandas 和 json_normalize 来展平嵌套的 JSON API 响应
Using pandas and json_normalize to flatten nested JSON API response
我有一个深度嵌套的 JSON,我正在尝试使用 json_normalize.
将其转换为 Pandas Dataframe
我正在处理的 JSON 数据的 generic sample 看起来像这样(我在 post):
{
"per_page": 2,
"total": 1,
"data": [{
"total_time": 0,
"collection_mode": "default",
"href": "https://api.surveymonkey.com/v3/responses/5007154325",
"custom_variables": {
"custvar_1": "one",
"custvar_2": "two"
},
"custom_value": "custom identifier for the response",
"edit_url": "https://www.surveymonkey.com/r/",
"analyze_url": "https://www.surveymonkey.com/analyze/browse/",
"ip_address": "",
"pages": [
{
"id": "103332310",
"questions": [{
"answers": [{
"choice_id": "3057839051"
}
],
"id": "319352786"
}
]
},
{
"id": "44783164",
"questions": [{
"id": "153745381",
"answers": [{
"text": "some_name"
}
]
}
]
},
{
"id": "44783183",
"questions": [{
"id": "153745436",
"answers": [{
"col_id": "1087201352",
"choice_id": "1087201369",
"row_id": "1087201362"
}, {
"col_id": "1087201353",
"choice_id": "1087201373",
"row_id": "1087201362"
}
]
}
]
}
],
"date_modified": "1970-01-17T19:07:34+00:00",
"response_status": "completed",
"id": "5007154325",
"collector_id": "50253586",
"recipient_id": "0",
"date_created": "1970-01-17T19:07:34+00:00",
"survey_id": "105723396"
}
],
"page": 1,
"links": {
"self": "https://api.surveymonkey.com/v3/surveys/123456/responses/bulk?page=1&per_page=2"
}
}
我想以包含 question_id、page_id、response_id 和如下响应数据的数据框结束:
choice_id col_id row_id text question_id page_id response_id
0 3057839051 NaN NaN NaN 319352786 103332310 5007154325
1 NaN NaN NaN some_name 153745381 44783164 5007154325
2 1087201369 1087201352 1087201362 NaN 153745436 44783183 5007154325
3 1087201373 1087201353 1087201362 NaN 153745436 44783183 5007154325
我可以通过 运行 宁以下代码 (Python 3.6):
df = json_normalize(data=so_survey_responses['data'], record_path=['pages', 'questions'], meta='id', record_prefix ='question_')
print(df)
哪个 returns:
question_answers question_id id
0 [{'choice_id': '3057839051'}] 319352786 5007154325
1 [{'text': 'some_name'}] 153745381 5007154325
2 [{'col_id': '1087201352', 'choice_id': '108720... 153745436 5007154325
但是如果我尝试在更深的嵌套中 运行 json_normalize 并保留上述结果中的 'question_id' 数据,我只能得到 page_id 值return,不正确 question_id 值:
answers_df = json_normalize(data=so_survey_responses['data'], record_path=['pages', 'questions', 'answers'], meta=['id', ['questions', 'id'], ['pages', 'id']])
print(answers_df)
Returns:
choice_id col_id row_id text id questions.id pages.id
0 3057839051 NaN NaN NaN 5007154325 103332310 103332310
1 NaN NaN NaN some_name 5007154325 44783164 44783164
2 1087201369 1087201352 1087201362 NaN 5007154325 44783183 44783183
3 1087201373 1087201353 1087201362 NaN 5007154325 44783183 44783183
一个复杂的因素可能是上述所有(question_id、page_id、response_id)在JSON数据中都是'id:'。
我确定这是可能的,但我做不到。有关如何执行此操作的任何示例?
其他上下文:
我正在尝试创建 SurveyMonkey API response output.
的数据框
我的长期目标是重新创建 "all responses" excel sheet that their export service provides。
我计划通过设置响应数据帧(如上),然后使用 .apply() to match responses with their survey structure API output。
我发现 SurveyMonkey API 在提供有用的输出方面相当乏味,但我是 Pandas 的新手,所以它可能在我身上。
无法使用 json_normalize()
以完全通用的方式执行此操作。您可以使用 record_path
和 meta
参数来指示您希望如何处理 JSON。
但是,您可以使用如何展平深层嵌套 JSON 并转换为 Pandas 数据帧的 flatten package to flatten your deeply nested JSON and then convert that to a Pandas dataframe. The page has example usage。
您需要修改最后一个选项的 meta
参数,并且,如果您想重命名列以完全按照您想要的方式命名,您可以使用 rename
:
answers_df = json_normalize(data=so_survey_responses['data'],
record_path=['pages', 'questions', 'answers'],
meta=['id', ['pages', 'questions', 'id'], ['pages', 'id']])\
.rename(index=str,
columns={'id': 'response_id', 'pages.questions.id': 'question_id', 'pages.id': 'page_id'})
我有一个深度嵌套的 JSON,我正在尝试使用 json_normalize.
将其转换为 Pandas Dataframe我正在处理的 JSON 数据的 generic sample 看起来像这样(我在 post):
{
"per_page": 2,
"total": 1,
"data": [{
"total_time": 0,
"collection_mode": "default",
"href": "https://api.surveymonkey.com/v3/responses/5007154325",
"custom_variables": {
"custvar_1": "one",
"custvar_2": "two"
},
"custom_value": "custom identifier for the response",
"edit_url": "https://www.surveymonkey.com/r/",
"analyze_url": "https://www.surveymonkey.com/analyze/browse/",
"ip_address": "",
"pages": [
{
"id": "103332310",
"questions": [{
"answers": [{
"choice_id": "3057839051"
}
],
"id": "319352786"
}
]
},
{
"id": "44783164",
"questions": [{
"id": "153745381",
"answers": [{
"text": "some_name"
}
]
}
]
},
{
"id": "44783183",
"questions": [{
"id": "153745436",
"answers": [{
"col_id": "1087201352",
"choice_id": "1087201369",
"row_id": "1087201362"
}, {
"col_id": "1087201353",
"choice_id": "1087201373",
"row_id": "1087201362"
}
]
}
]
}
],
"date_modified": "1970-01-17T19:07:34+00:00",
"response_status": "completed",
"id": "5007154325",
"collector_id": "50253586",
"recipient_id": "0",
"date_created": "1970-01-17T19:07:34+00:00",
"survey_id": "105723396"
}
],
"page": 1,
"links": {
"self": "https://api.surveymonkey.com/v3/surveys/123456/responses/bulk?page=1&per_page=2"
}
}
我想以包含 question_id、page_id、response_id 和如下响应数据的数据框结束:
choice_id col_id row_id text question_id page_id response_id
0 3057839051 NaN NaN NaN 319352786 103332310 5007154325
1 NaN NaN NaN some_name 153745381 44783164 5007154325
2 1087201369 1087201352 1087201362 NaN 153745436 44783183 5007154325
3 1087201373 1087201353 1087201362 NaN 153745436 44783183 5007154325
我可以通过 运行 宁以下代码 (Python 3.6):
df = json_normalize(data=so_survey_responses['data'], record_path=['pages', 'questions'], meta='id', record_prefix ='question_')
print(df)
哪个 returns:
question_answers question_id id
0 [{'choice_id': '3057839051'}] 319352786 5007154325
1 [{'text': 'some_name'}] 153745381 5007154325
2 [{'col_id': '1087201352', 'choice_id': '108720... 153745436 5007154325
但是如果我尝试在更深的嵌套中 运行 json_normalize 并保留上述结果中的 'question_id' 数据,我只能得到 page_id 值return,不正确 question_id 值:
answers_df = json_normalize(data=so_survey_responses['data'], record_path=['pages', 'questions', 'answers'], meta=['id', ['questions', 'id'], ['pages', 'id']])
print(answers_df)
Returns:
choice_id col_id row_id text id questions.id pages.id
0 3057839051 NaN NaN NaN 5007154325 103332310 103332310
1 NaN NaN NaN some_name 5007154325 44783164 44783164
2 1087201369 1087201352 1087201362 NaN 5007154325 44783183 44783183
3 1087201373 1087201353 1087201362 NaN 5007154325 44783183 44783183
一个复杂的因素可能是上述所有(question_id、page_id、response_id)在JSON数据中都是'id:'。
我确定这是可能的,但我做不到。有关如何执行此操作的任何示例?
其他上下文: 我正在尝试创建 SurveyMonkey API response output.
的数据框我的长期目标是重新创建 "all responses" excel sheet that their export service provides。
我计划通过设置响应数据帧(如上),然后使用 .apply() to match responses with their survey structure API output。
我发现 SurveyMonkey API 在提供有用的输出方面相当乏味,但我是 Pandas 的新手,所以它可能在我身上。
无法使用 json_normalize()
以完全通用的方式执行此操作。您可以使用 record_path
和 meta
参数来指示您希望如何处理 JSON。
但是,您可以使用如何展平深层嵌套 JSON 并转换为 Pandas 数据帧的 flatten package to flatten your deeply nested JSON and then convert that to a Pandas dataframe. The page has example usage。
您需要修改最后一个选项的 meta
参数,并且,如果您想重命名列以完全按照您想要的方式命名,您可以使用 rename
:
answers_df = json_normalize(data=so_survey_responses['data'],
record_path=['pages', 'questions', 'answers'],
meta=['id', ['pages', 'questions', 'id'], ['pages', 'id']])\
.rename(index=str,
columns={'id': 'response_id', 'pages.questions.id': 'question_id', 'pages.id': 'page_id'})