如何处理Python和Pandas中嵌套的JSON?
How to deal with nested JSON in Python and Pandas?
这是 API 给我的 JSON 对象之一的示例。其中有 100 个。
[{"id": "133248644",
"associations": {"deals": {"results": [{"id": "2762673039",
"type": "line_item_to_deal"}]}},
"properties": {
"createdate": "2020-08-06T15:05:23.253Z",
"description": null,
"hs_lastmodifieddate": "2020-08-06T15:05:23.253Z",
"hs_object_id": "133248644",
"name": "test product",
"price": "100"},
"createdAt": "2020-08-06T15:05:23.253Z",
"updatedAt": "2020-08-06T15:05:23.253Z",
"archived": false}]
我想创建一个 pandas 数据框,除了嵌套在“关联”下的 id 之外,它还有一个 id 列以及与之关联的所有属性。本质上,我想删除嵌套在属性下的属性和嵌套在关联下的 id(以及重命名)。我该怎么做?
这是我尝试解决问题的可重现示例:
import json
import pandas as pd
response = """[{"id": "133248644",
"associations": {"deals": {"results": [{"id": "2762673039",
"type": "line_item_to_deal"}]}},
"properties": {
"createdate": "2020-08-06T15:05:23.253Z",
"description": null,
"hs_lastmodifieddate": "2020-08-06T15:05:23.253Z",
"hs_object_id": "133248644",
"name": "test product",
"price": "100"},
"createdAt": "2020-08-06T15:05:23.253Z",
"updatedAt": "2020-08-06T15:05:23.253Z",
"archived": false},
{"id": "133345685",
"associations": {"deals": {"results": [{"id": "2762673038",
"type": "line_item_to_deal"}]}},
"properties": {
"createdate":
"2020-08-06T18:29:06.773Z",
"description": null,
"hs_lastmodifieddate": "2020-08-06T18:29:06.773Z",
"hs_object_id": "133345685",
"name": "TEST PRODUCT 2",
"price": "2222"},
"createdAt": "2020-08-06T18:29:06.773Z",
"updatedAt": "2020-08-06T18:29:06.773Z",
"archived": false}]"""
data = json.loads(response)
data_flat = [dict(id=x["id"], **x["properties"]) for x in data]
这是一个更好的解决方案,但仍然不够完美。
data_flat = [dict(lineid=x["id"],dealid=x["associations"]["deals"]["results"][0]["id"], **x["properties"]) for x in data]
最后,这非常有用,但仍然需要我以一种复杂的方式从关联列中提取 id。
normal_data = pd.normalize_data(data)
- 处理
list
个嵌套 dict
个是很复杂的。没有用于提取数据的可读 one-liner。
- 用
pandas.json_normalize
阅读data
associations.deals.results
是 dict
的 list
,使用 pandas.DataFrame.explode
将 list
中的每个 dict
分隔到单独的行
- 在
associations.deals.results
上使用 .json_normalize
将 dict
转换为列。
pandas.DataFrame.join
df
到规范化列。
id
已存在于数据框中,因此 dict
中的 id
将获得正确的后缀,但 type
不需要后缀,因为它不需要df
. 中不存在
- 使用
pandas.DataFrame.rename
重命名任何所需的列。
import pandas as pd
import json
# convert response from a string to a list of dicts
data = json.loads(response)
# create a pandas dataframe
df = pd.json_normalize(data)
# associations.deals.results is a list of dicts, explode them
df = df.explode('associations.deals.results').reset_index(drop=True)
# normalize the dicts in associations.deals.results and join them back to df
df = df.join(pd.json_normalize(df['associations.deals.results']), rsuffix='.associations.deals.results').drop(columns=['associations.deals.results'])
# display(df)
id createdAt updatedAt archived properties.createdate properties.description properties.hs_lastmodifieddate properties.hs_object_id properties.name properties.price id.associations.deals.results type
0 133248644 2020-08-06T15:05:23.253Z 2020-08-06T15:05:23.253Z False 2020-08-06T15:05:23.253Z None 2020-08-06T15:05:23.253Z 133248644 test product 100 2762673039 line_item_to_deal
1 133345685 2020-08-06T18:29:06.773Z 2020-08-06T18:29:06.773Z False 2020-08-06T18:29:06.773Z None 2020-08-06T18:29:06.773Z 133345685 TEST PRODUCT 2 2222 2762673038 line_item_to_deal
回应
response = """[{"id": "133248644",
"associations": {"deals": {"results": [{"id": "2762673039",
"type": "line_item_to_deal"}]}},
"properties": {
"createdate": "2020-08-06T15:05:23.253Z",
"description": null,
"hs_lastmodifieddate": "2020-08-06T15:05:23.253Z",
"hs_object_id": "133248644",
"name": "test product",
"price": "100"},
"createdAt": "2020-08-06T15:05:23.253Z",
"updatedAt": "2020-08-06T15:05:23.253Z",
"archived": false},
{"id": "133345685",
"associations": {"deals": {"results": [{"id": "2762673038",
"type": "line_item_to_deal"}]}},
"properties": {
"createdate":
"2020-08-06T18:29:06.773Z",
"description": null,
"hs_lastmodifieddate": "2020-08-06T18:29:06.773Z",
"hs_object_id": "133345685",
"name": "TEST PRODUCT 2",
"price": "2222"},
"createdAt": "2020-08-06T18:29:06.773Z",
"updatedAt": "2020-08-06T18:29:06.773Z",
"archived": false}]"""
这是 API 给我的 JSON 对象之一的示例。其中有 100 个。
[{"id": "133248644",
"associations": {"deals": {"results": [{"id": "2762673039",
"type": "line_item_to_deal"}]}},
"properties": {
"createdate": "2020-08-06T15:05:23.253Z",
"description": null,
"hs_lastmodifieddate": "2020-08-06T15:05:23.253Z",
"hs_object_id": "133248644",
"name": "test product",
"price": "100"},
"createdAt": "2020-08-06T15:05:23.253Z",
"updatedAt": "2020-08-06T15:05:23.253Z",
"archived": false}]
我想创建一个 pandas 数据框,除了嵌套在“关联”下的 id 之外,它还有一个 id 列以及与之关联的所有属性。本质上,我想删除嵌套在属性下的属性和嵌套在关联下的 id(以及重命名)。我该怎么做?
这是我尝试解决问题的可重现示例:
import json
import pandas as pd
response = """[{"id": "133248644",
"associations": {"deals": {"results": [{"id": "2762673039",
"type": "line_item_to_deal"}]}},
"properties": {
"createdate": "2020-08-06T15:05:23.253Z",
"description": null,
"hs_lastmodifieddate": "2020-08-06T15:05:23.253Z",
"hs_object_id": "133248644",
"name": "test product",
"price": "100"},
"createdAt": "2020-08-06T15:05:23.253Z",
"updatedAt": "2020-08-06T15:05:23.253Z",
"archived": false},
{"id": "133345685",
"associations": {"deals": {"results": [{"id": "2762673038",
"type": "line_item_to_deal"}]}},
"properties": {
"createdate":
"2020-08-06T18:29:06.773Z",
"description": null,
"hs_lastmodifieddate": "2020-08-06T18:29:06.773Z",
"hs_object_id": "133345685",
"name": "TEST PRODUCT 2",
"price": "2222"},
"createdAt": "2020-08-06T18:29:06.773Z",
"updatedAt": "2020-08-06T18:29:06.773Z",
"archived": false}]"""
data = json.loads(response)
data_flat = [dict(id=x["id"], **x["properties"]) for x in data]
这是一个更好的解决方案,但仍然不够完美。
data_flat = [dict(lineid=x["id"],dealid=x["associations"]["deals"]["results"][0]["id"], **x["properties"]) for x in data]
最后,这非常有用,但仍然需要我以一种复杂的方式从关联列中提取 id。
normal_data = pd.normalize_data(data)
- 处理
list
个嵌套dict
个是很复杂的。没有用于提取数据的可读 one-liner。 - 用
pandas.json_normalize
阅读data
associations.deals.results
是dict
的list
,使用pandas.DataFrame.explode
将list
中的每个dict
分隔到单独的行- 在
associations.deals.results
上使用.json_normalize
将dict
转换为列。 pandas.DataFrame.join
df
到规范化列。id
已存在于数据框中,因此dict
中的id
将获得正确的后缀,但type
不需要后缀,因为它不需要df
. 中不存在
- 使用
pandas.DataFrame.rename
重命名任何所需的列。
import pandas as pd
import json
# convert response from a string to a list of dicts
data = json.loads(response)
# create a pandas dataframe
df = pd.json_normalize(data)
# associations.deals.results is a list of dicts, explode them
df = df.explode('associations.deals.results').reset_index(drop=True)
# normalize the dicts in associations.deals.results and join them back to df
df = df.join(pd.json_normalize(df['associations.deals.results']), rsuffix='.associations.deals.results').drop(columns=['associations.deals.results'])
# display(df)
id createdAt updatedAt archived properties.createdate properties.description properties.hs_lastmodifieddate properties.hs_object_id properties.name properties.price id.associations.deals.results type
0 133248644 2020-08-06T15:05:23.253Z 2020-08-06T15:05:23.253Z False 2020-08-06T15:05:23.253Z None 2020-08-06T15:05:23.253Z 133248644 test product 100 2762673039 line_item_to_deal
1 133345685 2020-08-06T18:29:06.773Z 2020-08-06T18:29:06.773Z False 2020-08-06T18:29:06.773Z None 2020-08-06T18:29:06.773Z 133345685 TEST PRODUCT 2 2222 2762673038 line_item_to_deal
回应
response = """[{"id": "133248644",
"associations": {"deals": {"results": [{"id": "2762673039",
"type": "line_item_to_deal"}]}},
"properties": {
"createdate": "2020-08-06T15:05:23.253Z",
"description": null,
"hs_lastmodifieddate": "2020-08-06T15:05:23.253Z",
"hs_object_id": "133248644",
"name": "test product",
"price": "100"},
"createdAt": "2020-08-06T15:05:23.253Z",
"updatedAt": "2020-08-06T15:05:23.253Z",
"archived": false},
{"id": "133345685",
"associations": {"deals": {"results": [{"id": "2762673038",
"type": "line_item_to_deal"}]}},
"properties": {
"createdate":
"2020-08-06T18:29:06.773Z",
"description": null,
"hs_lastmodifieddate": "2020-08-06T18:29:06.773Z",
"hs_object_id": "133345685",
"name": "TEST PRODUCT 2",
"price": "2222"},
"createdAt": "2020-08-06T18:29:06.773Z",
"updatedAt": "2020-08-06T18:29:06.773Z",
"archived": false}]"""