如何 json_normalize pandas 中包含空列表的列,而不丢失记录
How to json_normalize a column in pandas with empty lists, without losing records
我正在使用 pd.json_normalize
将此数据中的 "sections"
字段展平为行。它工作正常,除了 "sections"
是空列表的行。
此 ID 被完全忽略,并且在最终的展平数据框中丢失。我需要确保数据中每个唯一 ID 至少有一行(某些 ID 可能有很多行,每个唯一 ID 最多一行,每个唯一 section_id
、question_id
和 answer_id
因为我在数据中取消了更多字段):
{'_id': '5f48f708fe22ca4d15fb3b55',
'created_at': '2020-08-28T12:22:32Z',
'sections': []}]
示例数据:
sample = [{'_id': '5f48bee4c54cf6b5e8048274',
'created_at': '2020-08-28T08:23:00Z',
'sections': [{'comment': '',
'type_fail': None,
'answers': [{'comment': 'stuff',
'feedback': [],
'value': 10.0,
'answer_type': 'default',
'question_id': '5e59599c68369c24069630fd',
'answer_id': '5e595a7c3fbb70448b6ff935'},
{'comment': 'stuff',
'feedback': [],
'value': 10.0,
'answer_type': 'default',
'question_id': '5e598939cedcaf5b865ef99a',
'answer_id': '5e598939cedcaf5b865ef998'}],
'score': 20.0,
'passed': True,
'_id': '5e59599c68369c24069630fe',
'custom_fields': []},
{'comment': '',
'type_fail': None,
'answers': [{'comment': '',
'feedback': [],
'value': None,
'answer_type': 'not_applicable',
'question_id': '5e59894f68369c2398eb68a8',
'answer_id': '5eaad4e5b513aed9a3c996a5'},
{'comment': '',
'feedback': [],
'value': None,
'answer_type': 'not_applicable',
'question_id': '5e598967cedcaf5b865efe3e',
'answer_id': '5eaad4ece3f1e0794372f8b2'},
{'comment': "stuff",
'feedback': [],
'value': 0.0,
'answer_type': 'default',
'question_id': '5e598976cedcaf5b865effd1',
'answer_id': '5e598976cedcaf5b865effd3'}],
'score': 0.0,
'passed': True,
'_id': '5e59894f68369c2398eb68a9',
'custom_fields': []}]},
{'_id': '5f48f708fe22ca4d15fb3b55',
'created_at': '2020-08-28T12:22:32Z',
'sections': []}]
测试:
df = pd.json_normalize(sample)
df2 = pd.json_normalize(df.to_dict(orient="records"), meta=["_id", "created_at"], record_path="sections", record_prefix="section_")
此时,我现在缺少 ID“5f48f708fe22ca4d15fb3b55”的一行,我仍然需要它。
df3 = pd.json_normalize(df2.to_dict(orient="records"), meta=["_id", "created_at", "section__id", "section_score", "section_passed", "section_type_fail", "section_comment"], record_path="section_answers", record_prefix="")
我能否以某种方式更改它以确保每个 ID 至少得到一行?我正在处理数百万条记录,不想稍后意识到我的最终数据中缺少某些 ID。我能想到的唯一解决方案是规范化每个数据帧,然后再次将其加入原始数据帧。
这是 json_normalize
的一个已知问题。我还没有找到使用 json_normalize
执行此操作的方法。你可以尝试使用 flatten_json 这样的东西:
import flatten_json as fj
dic = (fj.flatten(d) for d in sample)
df = pd.DataFrame(dic)
print(df)
_id created_at sections_0_comment ... sections_1__id sections_1_custom_fields sections
0 5f48bee4c54cf6b5e8048274 2020-08-28T08:23:00Z ... 5e59894f68369c2398eb68a9 [] NaN
1 5f48f708fe22ca4d15fb3b55 2020-08-28T12:22:32Z NaN ... NaN NaN []
- 解决此问题的最佳方法是修复
dict
- 如果
sections
为空list
,用[{'answers': [{}]}]
填充
for i, d in enumerate(sample):
if not d['sections']:
sample[i]['sections'] = [{'answers': [{}]}]
df = pd.json_normalize(sample)
df2 = pd.json_normalize(df.to_dict(orient="records"), meta=["_id", "created_at"], record_path="sections", record_prefix="section_")
# display(df2)
section_comment section_type_fail section_answers section_score section_passed section__id section_custom_fields _id created_at
0 NaN [{'comment': 'stuff', 'feedback': [], 'value': 10.0, 'answer_type': 'default', 'question_id': '5e59599c68369c24069630fd', 'answer_id': '5e595a7c3fbb70448b6ff935'}, {'comment': 'stuff', 'feedback': [], 'value': 10.0, 'answer_type': 'default', 'question_id': '5e598939cedcaf5b865ef99a', 'answer_id': '5e598939cedcaf5b865ef998'}] 20.0 True 5e59599c68369c24069630fe [] 5f48bee4c54cf6b5e8048274 2020-08-28T08:23:00Z
1 NaN [{'comment': '', 'feedback': [], 'value': None, 'answer_type': 'not_applicable', 'question_id': '5e59894f68369c2398eb68a8', 'answer_id': '5eaad4e5b513aed9a3c996a5'}, {'comment': '', 'feedback': [], 'value': None, 'answer_type': 'not_applicable', 'question_id': '5e598967cedcaf5b865efe3e', 'answer_id': '5eaad4ece3f1e0794372f8b2'}, {'comment': 'stuff', 'feedback': [], 'value': 0.0, 'answer_type': 'default', 'question_id': '5e598976cedcaf5b865effd1', 'answer_id': '5e598976cedcaf5b865effd3'}] 0.0 True 5e59894f68369c2398eb68a9 [] 5f48bee4c54cf6b5e8048274 2020-08-28T08:23:00Z
2 NaN NaN [{}] NaN NaN NaN NaN 5f48f708fe22ca4d15fb3b55 2020-08-28T12:22:32Z
df3 = pd.json_normalize(df2.to_dict(orient="records"), meta=["_id", "created_at", "section__id", "section_score", "section_passed", "section_type_fail", "section_comment"], record_path="section_answers", record_prefix="")
# display(df3)
comment feedback value answer_type question_id answer_id _id created_at section__id section_score section_passed section_type_fail section_comment
0 stuff [] 10.0 default 5e59599c68369c24069630fd 5e595a7c3fbb70448b6ff935 5f48bee4c54cf6b5e8048274 2020-08-28T08:23:00Z 5e59599c68369c24069630fe 20 True NaN
1 stuff [] 10.0 default 5e598939cedcaf5b865ef99a 5e598939cedcaf5b865ef998 5f48bee4c54cf6b5e8048274 2020-08-28T08:23:00Z 5e59599c68369c24069630fe 20 True NaN
2 [] NaN not_applicable 5e59894f68369c2398eb68a8 5eaad4e5b513aed9a3c996a5 5f48bee4c54cf6b5e8048274 2020-08-28T08:23:00Z 5e59894f68369c2398eb68a9 0 True NaN
3 [] NaN not_applicable 5e598967cedcaf5b865efe3e 5eaad4ece3f1e0794372f8b2 5f48bee4c54cf6b5e8048274 2020-08-28T08:23:00Z 5e59894f68369c2398eb68a9 0 True NaN
4 stuff [] 0.0 default 5e598976cedcaf5b865effd1 5e598976cedcaf5b865effd3 5f48bee4c54cf6b5e8048274 2020-08-28T08:23:00Z 5e59894f68369c2398eb68a9 0 True NaN
5 NaN NaN NaN NaN NaN NaN 5f48f708fe22ca4d15fb3b55 2020-08-28T12:22:32Z NaN NaN NaN NaN NaN
我正在使用 pd.json_normalize
将此数据中的 "sections"
字段展平为行。它工作正常,除了 "sections"
是空列表的行。
此 ID 被完全忽略,并且在最终的展平数据框中丢失。我需要确保数据中每个唯一 ID 至少有一行(某些 ID 可能有很多行,每个唯一 ID 最多一行,每个唯一 section_id
、question_id
和 answer_id
因为我在数据中取消了更多字段):
{'_id': '5f48f708fe22ca4d15fb3b55',
'created_at': '2020-08-28T12:22:32Z',
'sections': []}]
示例数据:
sample = [{'_id': '5f48bee4c54cf6b5e8048274',
'created_at': '2020-08-28T08:23:00Z',
'sections': [{'comment': '',
'type_fail': None,
'answers': [{'comment': 'stuff',
'feedback': [],
'value': 10.0,
'answer_type': 'default',
'question_id': '5e59599c68369c24069630fd',
'answer_id': '5e595a7c3fbb70448b6ff935'},
{'comment': 'stuff',
'feedback': [],
'value': 10.0,
'answer_type': 'default',
'question_id': '5e598939cedcaf5b865ef99a',
'answer_id': '5e598939cedcaf5b865ef998'}],
'score': 20.0,
'passed': True,
'_id': '5e59599c68369c24069630fe',
'custom_fields': []},
{'comment': '',
'type_fail': None,
'answers': [{'comment': '',
'feedback': [],
'value': None,
'answer_type': 'not_applicable',
'question_id': '5e59894f68369c2398eb68a8',
'answer_id': '5eaad4e5b513aed9a3c996a5'},
{'comment': '',
'feedback': [],
'value': None,
'answer_type': 'not_applicable',
'question_id': '5e598967cedcaf5b865efe3e',
'answer_id': '5eaad4ece3f1e0794372f8b2'},
{'comment': "stuff",
'feedback': [],
'value': 0.0,
'answer_type': 'default',
'question_id': '5e598976cedcaf5b865effd1',
'answer_id': '5e598976cedcaf5b865effd3'}],
'score': 0.0,
'passed': True,
'_id': '5e59894f68369c2398eb68a9',
'custom_fields': []}]},
{'_id': '5f48f708fe22ca4d15fb3b55',
'created_at': '2020-08-28T12:22:32Z',
'sections': []}]
测试:
df = pd.json_normalize(sample)
df2 = pd.json_normalize(df.to_dict(orient="records"), meta=["_id", "created_at"], record_path="sections", record_prefix="section_")
此时,我现在缺少 ID“5f48f708fe22ca4d15fb3b55”的一行,我仍然需要它。
df3 = pd.json_normalize(df2.to_dict(orient="records"), meta=["_id", "created_at", "section__id", "section_score", "section_passed", "section_type_fail", "section_comment"], record_path="section_answers", record_prefix="")
我能否以某种方式更改它以确保每个 ID 至少得到一行?我正在处理数百万条记录,不想稍后意识到我的最终数据中缺少某些 ID。我能想到的唯一解决方案是规范化每个数据帧,然后再次将其加入原始数据帧。
这是 json_normalize
的一个已知问题。我还没有找到使用 json_normalize
执行此操作的方法。你可以尝试使用 flatten_json 这样的东西:
import flatten_json as fj
dic = (fj.flatten(d) for d in sample)
df = pd.DataFrame(dic)
print(df)
_id created_at sections_0_comment ... sections_1__id sections_1_custom_fields sections
0 5f48bee4c54cf6b5e8048274 2020-08-28T08:23:00Z ... 5e59894f68369c2398eb68a9 [] NaN
1 5f48f708fe22ca4d15fb3b55 2020-08-28T12:22:32Z NaN ... NaN NaN []
- 解决此问题的最佳方法是修复
dict
- 如果
sections
为空list
,用[{'answers': [{}]}]
填充
for i, d in enumerate(sample):
if not d['sections']:
sample[i]['sections'] = [{'answers': [{}]}]
df = pd.json_normalize(sample)
df2 = pd.json_normalize(df.to_dict(orient="records"), meta=["_id", "created_at"], record_path="sections", record_prefix="section_")
# display(df2)
section_comment section_type_fail section_answers section_score section_passed section__id section_custom_fields _id created_at
0 NaN [{'comment': 'stuff', 'feedback': [], 'value': 10.0, 'answer_type': 'default', 'question_id': '5e59599c68369c24069630fd', 'answer_id': '5e595a7c3fbb70448b6ff935'}, {'comment': 'stuff', 'feedback': [], 'value': 10.0, 'answer_type': 'default', 'question_id': '5e598939cedcaf5b865ef99a', 'answer_id': '5e598939cedcaf5b865ef998'}] 20.0 True 5e59599c68369c24069630fe [] 5f48bee4c54cf6b5e8048274 2020-08-28T08:23:00Z
1 NaN [{'comment': '', 'feedback': [], 'value': None, 'answer_type': 'not_applicable', 'question_id': '5e59894f68369c2398eb68a8', 'answer_id': '5eaad4e5b513aed9a3c996a5'}, {'comment': '', 'feedback': [], 'value': None, 'answer_type': 'not_applicable', 'question_id': '5e598967cedcaf5b865efe3e', 'answer_id': '5eaad4ece3f1e0794372f8b2'}, {'comment': 'stuff', 'feedback': [], 'value': 0.0, 'answer_type': 'default', 'question_id': '5e598976cedcaf5b865effd1', 'answer_id': '5e598976cedcaf5b865effd3'}] 0.0 True 5e59894f68369c2398eb68a9 [] 5f48bee4c54cf6b5e8048274 2020-08-28T08:23:00Z
2 NaN NaN [{}] NaN NaN NaN NaN 5f48f708fe22ca4d15fb3b55 2020-08-28T12:22:32Z
df3 = pd.json_normalize(df2.to_dict(orient="records"), meta=["_id", "created_at", "section__id", "section_score", "section_passed", "section_type_fail", "section_comment"], record_path="section_answers", record_prefix="")
# display(df3)
comment feedback value answer_type question_id answer_id _id created_at section__id section_score section_passed section_type_fail section_comment
0 stuff [] 10.0 default 5e59599c68369c24069630fd 5e595a7c3fbb70448b6ff935 5f48bee4c54cf6b5e8048274 2020-08-28T08:23:00Z 5e59599c68369c24069630fe 20 True NaN
1 stuff [] 10.0 default 5e598939cedcaf5b865ef99a 5e598939cedcaf5b865ef998 5f48bee4c54cf6b5e8048274 2020-08-28T08:23:00Z 5e59599c68369c24069630fe 20 True NaN
2 [] NaN not_applicable 5e59894f68369c2398eb68a8 5eaad4e5b513aed9a3c996a5 5f48bee4c54cf6b5e8048274 2020-08-28T08:23:00Z 5e59894f68369c2398eb68a9 0 True NaN
3 [] NaN not_applicable 5e598967cedcaf5b865efe3e 5eaad4ece3f1e0794372f8b2 5f48bee4c54cf6b5e8048274 2020-08-28T08:23:00Z 5e59894f68369c2398eb68a9 0 True NaN
4 stuff [] 0.0 default 5e598976cedcaf5b865effd1 5e598976cedcaf5b865effd3 5f48bee4c54cf6b5e8048274 2020-08-28T08:23:00Z 5e59894f68369c2398eb68a9 0 True NaN
5 NaN NaN NaN NaN NaN NaN 5f48f708fe22ca4d15fb3b55 2020-08-28T12:22:32Z NaN NaN NaN NaN NaN