json_normalize 在尝试提取某些属性时产生 KeyError

Question

这是我的 json 文件的一个子集：

d = {'data': {'questions': [{'id': 6574,
                             'text': 'Question #1',
                             'instructionalText': '',
                             'minimumResponses': 0,
                             'maximumResponses': None,
                             'sortOrder': 1,
                             'answers': [{'id': 362949, 'text': 'Answer #1', 'parentId': None},
                                         {'id': 362950, 'text': 'Answer #2', 'parentId': None},
                                         {'id': 362951, 'text': 'Answer #3', 'parentId': None},
                                         {'id': 362952, 'text': 'Answer #4', 'parentId': None}]}]}}

我想将其放入一个数据框中，其中包含每个问题和每个答案的一行。

Python代码：

from pandas import json_normalize
import json

fields = ['text','answers.text']

with open(R'response.json') as f:
    d = json.load(f)

data = json_normalize(d['data'],['questions'],errors='ignore')
data = data[fields]

print(data)

这会产生一个 KeyError：

KeyError: "['answers.text'] not in index"

在这个问题上待了几个小时，绝对想不通。我觉得它应该很简单，但它从来没有。

Answer 1

这是我常用的技巧

json_normalize() 顶级列表
explode() child list, reset_index() 第 3 步
用 apply(pd.Series)

list

dict

d = {'data': {'questions': [{'id': 6574,
    'text': 'Question #1',
    'instructionalText': '',
    'minimumResponses': 0,
    'maximumResponses': None,
    'sortOrder': 1,
    'answers': [{'id': 362949, 'text': 'Answer #1', 'parentId': None},
     {'id': 362950, 'text': 'Answer #2', 'parentId': None},
     {'id': 362951, 'text': 'Answer #3', 'parentId': None},
     {'id': 362952, 'text': 'Answer #4', 'parentId': None}]}]}}

df = pd.json_normalize(d["data"]["questions"]).explode("answers").reset_index(drop=True)
df = df.join(df["answers"].apply(pd.Series), rsuffix="_ans").drop(columns="answers")

	id	text	sortOrder	id_ans	text_ans
0	6574	Question #1	1	362949	Answer #1
1	6574	Question #1	1	362950	Answer #2
2	6574	Question #1	1	362951	Answer #3
3	6574	Question #1	1	362952	Answer #4

Answer 2

使用record_prefix，配合record_path和meta，所以d可以一下子归一化
- pd.json_normalize 将在 record_path 和 meta 之间存在重叠的 key 名称以及 'id' 和'text' 两者都在。
- ValueError: Conflicting metadata name id, need distinguishing prefix 发生时没有使用 record_path.
出现KeyError是因为'answers.text'不在d中，它是由.json_normalize()
如果 df 中不需要任何顶级 keys，请将它们从 meta 中删除。

import pandas as pd

# normalize d
df = pd.json_normalize(data=d['data']['questions'],
                       record_path= ['answers'],
                       meta=['id', 'text', 'instructionalText', 'minimumResponses', 'maximumResponses', 'sortOrder'],
                       record_prefix='answers_')

# display(df)
   answers_id answers_text answers_parentId    id         text     instructionalText minimumResponses maximumResponses sortOrder
0      362949    Answer #1             None  6574  Question #1                                      0             None         1
1      362950    Answer #2             None  6574  Question #1                                      0             None         1
2      362951    Answer #3             None  6574  Question #1                                      0             None         1
3      362952    Answer #4             None  6574  Question #1                                      0             None         1
4      262949    Answer #1             None  4756  Question #2  No cheating, cheater                0             None         1
5      262950    Answer #2             None  4756  Question #2  No cheating, cheater                0             None         1
6      262951    Answer #3             None  4756  Question #2  No cheating, cheater                0             None         1
7      262952    Answer #4             None  4756  Question #2  No cheating, cheater                0             None         1

扩展测试数据

d = {'data': {'questions': [{'id': 6574,
                             'text': 'Question #1',
                             'instructionalText': '',
                             'minimumResponses': 0,
                             'maximumResponses': None,
                             'sortOrder': 1,
                             'answers': [{'id': 362949, 'text': 'Answer #1', 'parentId': None},
                                         {'id': 362950, 'text': 'Answer #2', 'parentId': None},
                                         {'id': 362951, 'text': 'Answer #3', 'parentId': None},
                                         {'id': 362952, 'text': 'Answer #4', 'parentId': None}]},
                            {'id': 4756,
                             'text': 'Question #2',
                             'instructionalText': 'No cheating, cheater',
                             'minimumResponses': 0,
                             'maximumResponses': None,
                             'sortOrder': 1,
                             'answers': [{'id': 262949, 'text': 'Answer #1', 'parentId': None},
                                         {'id': 262950, 'text': 'Answer #2', 'parentId': None},
                                         {'id': 262951, 'text': 'Answer #3', 'parentId': None},
                                         {'id': 262952, 'text': 'Answer #4', 'parentId': None}]}]}}

关于另一个
，不推荐使用 .apply(pd.Series)，因为它非常慢。
- 见timing analysis in SO: Splitting dictionary/list inside a Pandas Column into Separate Columns
- 1000 万行需要 53 分钟

json_normalize 在尝试提取某些属性时产生 KeyError

json_normalize produces a KeyError when trying to extract certain attributes

python

json

pandas

json-normalize

扩展测试数据