使用 pandas 和 json_normalize 来展平嵌套的 JSON API 响应

Using pandas and json_normalize to flatten nested JSON API response

我有一个深度嵌套的 JSON,我正在尝试使用 json_normalize.

将其转换为 Pandas Dataframe

我正在处理的 JSON 数据的 generic sample 看起来像这样(我在 post):

{
    "per_page": 2,
    "total": 1,
    "data": [{
            "total_time": 0,
            "collection_mode": "default",
            "href": "https://api.surveymonkey.com/v3/responses/5007154325",
            "custom_variables": {
                "custvar_1": "one",
                "custvar_2": "two"
            },
            "custom_value": "custom identifier for the response",
            "edit_url": "https://www.surveymonkey.com/r/",
            "analyze_url": "https://www.surveymonkey.com/analyze/browse/",
            "ip_address": "",
            "pages": [
                {
                    "id": "103332310",
                    "questions": [{
                            "answers": [{
                                    "choice_id": "3057839051"
                                }
                            ],
                            "id": "319352786"
                        }
                    ]
                },
                {
                    "id": "44783164",
                    "questions": [{
                            "id": "153745381",
                            "answers": [{
                                    "text": "some_name"
                                }
                            ]
                        }
                    ]
                },
                {
                    "id": "44783183",
                    "questions": [{
                            "id": "153745436",
                            "answers": [{
                                    "col_id": "1087201352",
                                    "choice_id": "1087201369",
                                    "row_id": "1087201362"
                                }, {
                                    "col_id": "1087201353",
                                    "choice_id": "1087201373",
                                    "row_id": "1087201362"
                                }
                                ]
                            }
                        ]
                }
            ],
            "date_modified": "1970-01-17T19:07:34+00:00",
            "response_status": "completed",
            "id": "5007154325",
            "collector_id": "50253586",
            "recipient_id": "0",
            "date_created": "1970-01-17T19:07:34+00:00",
            "survey_id": "105723396"
        }
    ],
    "page": 1,
    "links": {
        "self": "https://api.surveymonkey.com/v3/surveys/123456/responses/bulk?page=1&per_page=2"
    }
}

我想以包含 question_id、page_id、response_id 和如下响应数据的数据框结束:

    choice_id      col_id      row_id       text   question_id       page_id      response_id
0  3057839051         NaN         NaN        NaN     319352786     103332310       5007154325
1         NaN         NaN         NaN  some_name     153745381      44783164       5007154325
2  1087201369  1087201352  1087201362        NaN     153745436      44783183       5007154325
3  1087201373  1087201353  1087201362        NaN     153745436      44783183       5007154325

我可以通过 运行 宁以下代码 (Python 3.6):

df = json_normalize(data=so_survey_responses['data'], record_path=['pages', 'questions'], meta='id', record_prefix ='question_')
print(df)

哪个 returns:

                                    question_answers question_id          id
0                      [{'choice_id': '3057839051'}]   319352786  5007154325
1                            [{'text': 'some_name'}]   153745381  5007154325
2  [{'col_id': '1087201352', 'choice_id': '108720...   153745436  5007154325

但是如果我尝试在更深的嵌套中 运行 json_normalize 并保留上述结果中的 'question_id' 数据,我只能得到 page_id 值return,不正确 question_id 值:

answers_df = json_normalize(data=so_survey_responses['data'], record_path=['pages', 'questions', 'answers'], meta=['id', ['questions', 'id'], ['pages', 'id']])
print(answers_df)

Returns:

    choice_id      col_id      row_id       text          id questions.id   pages.id
0  3057839051         NaN         NaN        NaN  5007154325    103332310  103332310
1         NaN         NaN         NaN  some_name  5007154325     44783164   44783164
2  1087201369  1087201352  1087201362        NaN  5007154325     44783183   44783183
3  1087201373  1087201353  1087201362        NaN  5007154325     44783183   44783183

一个复杂的因素可能是上述所有(question_id、page_id、response_id)在JSON数据中都是'id:'。

我确定这是可能的,但我做不到。有关如何执行此操作的任何示例?

其他上下文: 我正在尝试创建 SurveyMonkey API response output.

的数据框

我的长期目标是重新创建 "all responses" excel sheet that their export service provides

我计划通过设置响应数据帧(如上),然后使用 .apply() to match responses with their survey structure API output

我发现 SurveyMonkey API 在提供有用的输出方面相当乏味,但我是 Pandas 的新手,所以它可能在我身上。

无法使用 json_normalize() 以完全通用的方式执行此操作。您可以使用 record_pathmeta 参数来指示您希望如何处理 JSON。

但是,您可以使用如何展平深层嵌套 JSON 并转换为 Pandas 数据帧的 flatten package to flatten your deeply nested JSON and then convert that to a Pandas dataframe. The page has example usage

您需要修改最后一个选项的 meta 参数,并且,如果您想重命名列以完全按照您想要的方式命名,您可以使用 rename:

answers_df = json_normalize(data=so_survey_responses['data'],
                        record_path=['pages', 'questions', 'answers'],
                        meta=['id', ['pages', 'questions', 'id'], ['pages', 'id']])\
.rename(index=str,
        columns={'id': 'response_id', 'pages.questions.id': 'question_id', 'pages.id': 'page_id'})