json_normalize 在尝试提取某些属性时产生 KeyError

json_normalize produces a KeyError when trying to extract certain attributes

这是我的 json 文件的一个子集:

d = {'data': {'questions': [{'id': 6574,
                             'text': 'Question #1',
                             'instructionalText': '',
                             'minimumResponses': 0,
                             'maximumResponses': None,
                             'sortOrder': 1,
                             'answers': [{'id': 362949, 'text': 'Answer #1', 'parentId': None},
                                         {'id': 362950, 'text': 'Answer #2', 'parentId': None},
                                         {'id': 362951, 'text': 'Answer #3', 'parentId': None},
                                         {'id': 362952, 'text': 'Answer #4', 'parentId': None}]}]}}

我想将其放入一个数据框中,其中包含每个问题和每个答案的一行。

Python代码:

from pandas import json_normalize
import json

fields = ['text','answers.text']

with open(R'response.json') as f:
    d = json.load(f)

data = json_normalize(d['data'],['questions'],errors='ignore')
data = data[fields]

print(data)

这会产生一个 KeyError:

KeyError: "['answers.text'] not in index"

在这个问题上待了几个小时,绝对想不通。我觉得它应该很简单,但它从来没有。

这是我常用的技巧

  1. json_normalize() 顶级列表
  2. explode() child list, reset_index() 第 3 步
  3. apply(pd.Series)
  4. 在 child list 内扩展 dict
d = {'data': {'questions': [{'id': 6574,
    'text': 'Question #1',
    'instructionalText': '',
    'minimumResponses': 0,
    'maximumResponses': None,
    'sortOrder': 1,
    'answers': [{'id': 362949, 'text': 'Answer #1', 'parentId': None},
     {'id': 362950, 'text': 'Answer #2', 'parentId': None},
     {'id': 362951, 'text': 'Answer #3', 'parentId': None},
     {'id': 362952, 'text': 'Answer #4', 'parentId': None}]}]}}

df = pd.json_normalize(d["data"]["questions"]).explode("answers").reset_index(drop=True)
df = df.join(df["answers"].apply(pd.Series), rsuffix="_ans").drop(columns="answers")

id text instructionalText minimumResponses maximumResponses sortOrder id_ans text_ans parentId
0 6574 Question #1 0 1 362949 Answer #1
1 6574 Question #1 0 1 362950 Answer #2
2 6574 Question #1 0 1 362951 Answer #3
3 6574 Question #1 0 1 362952 Answer #4
  • 使用record_prefix,配合record_pathmeta,所以d可以一下子归一化
    • pd.json_normalize 将在 record_pathmeta 之间存在重叠的 key 名称以及 'id''text' 两者都在。
    • ValueError: Conflicting metadata name id, need distinguishing prefix 发生时没有使用 record_path.
  • 出现KeyError是因为'answers.text'不在d中,它是由.json_normalize()
  • 创建的
  • 如果 df 中不需要任何顶级 keys,请将它们从 meta 中删除。
import pandas as pd

# normalize d
df = pd.json_normalize(data=d['data']['questions'],
                       record_path= ['answers'],
                       meta=['id', 'text', 'instructionalText', 'minimumResponses', 'maximumResponses', 'sortOrder'],
                       record_prefix='answers_')

# display(df)
   answers_id answers_text answers_parentId    id         text     instructionalText minimumResponses maximumResponses sortOrder
0      362949    Answer #1             None  6574  Question #1                                      0             None         1
1      362950    Answer #2             None  6574  Question #1                                      0             None         1
2      362951    Answer #3             None  6574  Question #1                                      0             None         1
3      362952    Answer #4             None  6574  Question #1                                      0             None         1
4      262949    Answer #1             None  4756  Question #2  No cheating, cheater                0             None         1
5      262950    Answer #2             None  4756  Question #2  No cheating, cheater                0             None         1
6      262951    Answer #3             None  4756  Question #2  No cheating, cheater                0             None         1
7      262952    Answer #4             None  4756  Question #2  No cheating, cheater                0             None         1

扩展测试数据

d = {'data': {'questions': [{'id': 6574,
                             'text': 'Question #1',
                             'instructionalText': '',
                             'minimumResponses': 0,
                             'maximumResponses': None,
                             'sortOrder': 1,
                             'answers': [{'id': 362949, 'text': 'Answer #1', 'parentId': None},
                                         {'id': 362950, 'text': 'Answer #2', 'parentId': None},
                                         {'id': 362951, 'text': 'Answer #3', 'parentId': None},
                                         {'id': 362952, 'text': 'Answer #4', 'parentId': None}]},
                            {'id': 4756,
                             'text': 'Question #2',
                             'instructionalText': 'No cheating, cheater',
                             'minimumResponses': 0,
                             'maximumResponses': None,
                             'sortOrder': 1,
                             'answers': [{'id': 262949, 'text': 'Answer #1', 'parentId': None},
                                         {'id': 262950, 'text': 'Answer #2', 'parentId': None},
                                         {'id': 262951, 'text': 'Answer #3', 'parentId': None},
                                         {'id': 262952, 'text': 'Answer #4', 'parentId': None}]}]}}

  • 关于另一个 ,不推荐使用 .apply(pd.Series),因为它非常慢。
    • timing analysis in SO: Splitting dictionary/list inside a Pandas Column into Separate Columns
    • 1000 万行需要 53 分钟