如何处理Python和Pandas中嵌套的JSON？

Question

这是 API 给我的 JSON 对象之一的示例。其中有 100 个。

[{"id": "133248644",
"associations": {"deals": {"results": [{"id": "2762673039",
                                          "type": "line_item_to_deal"}]}},
  "properties": {
            "createdate": "2020-08-06T15:05:23.253Z",
            "description": null,
            "hs_lastmodifieddate": "2020-08-06T15:05:23.253Z",
            "hs_object_id": "133248644",
            "name": "test product",
            "price": "100"},
 "createdAt": "2020-08-06T15:05:23.253Z",
 "updatedAt": "2020-08-06T15:05:23.253Z",
 "archived": false}]

我想创建一个 pandas 数据框，除了嵌套在“关联”下的 id 之外，它还有一个 id 列以及与之关联的所有属性。本质上，我想删除嵌套在属性下的属性和嵌套在关联下的 id（以及重命名）。我该怎么做？

这是我尝试解决问题的可重现示例：

import json
import pandas as pd

response = """[{"id": "133248644",
"associations": {"deals": {"results": [{"id": "2762673039",
                                          "type": "line_item_to_deal"}]}},
  "properties": {
            "createdate": "2020-08-06T15:05:23.253Z",
            "description": null,
            "hs_lastmodifieddate": "2020-08-06T15:05:23.253Z",
            "hs_object_id": "133248644",
            "name": "test product",
            "price": "100"},
 "createdAt": "2020-08-06T15:05:23.253Z",
 "updatedAt": "2020-08-06T15:05:23.253Z",
 "archived": false}, 
{"id": "133345685",
 "associations": {"deals": {"results": [{"id": "2762673038",
                                          "type": "line_item_to_deal"}]}},
 "properties": {
             "createdate": 
             "2020-08-06T18:29:06.773Z", 
             "description": null,
             "hs_lastmodifieddate": "2020-08-06T18:29:06.773Z",
             "hs_object_id": "133345685",
             "name": "TEST PRODUCT 2",
             "price": "2222"},
 "createdAt": "2020-08-06T18:29:06.773Z", 
 "updatedAt": "2020-08-06T18:29:06.773Z",
 "archived": false}]"""


data = json.loads(response)
data_flat = [dict(id=x["id"], **x["properties"]) for x in data]

这是一个更好的解决方案，但仍然不够完美。

data_flat = [dict(lineid=x["id"],dealid=x["associations"]["deals"]["results"][0]["id"], **x["properties"]) for x in data]

最后，这非常有用，但仍然需要我以一种复杂的方式从关联列中提取 id。

normal_data = pd.normalize_data(data)

Answer 1

处理 list 个嵌套 dict 个是很复杂的。没有用于提取数据的可读 one-liner。
用pandas.json_normalize阅读data
associations.deals.results 是 dict 的 list，使用 pandas.DataFrame.explode 将 list 中的每个 dict 分隔到单独的行
在 associations.deals.results 上使用 .json_normalize 将 dict 转换为列。
pandas.DataFrame.join df 到规范化列。
- id 已存在于数据框中，因此 dict 中的 id 将获得正确的后缀，但 type 不需要后缀，因为它不需要df.
使用 pandas.DataFrame.rename 重命名任何所需的列。

import pandas as pd
import json

# convert response from a string to a list of dicts
data = json.loads(response)

# create a pandas dataframe
df = pd.json_normalize(data)

# associations.deals.results is a list of dicts, explode them
df = df.explode('associations.deals.results').reset_index(drop=True)

# normalize the dicts in associations.deals.results and join them back to df
df = df.join(pd.json_normalize(df['associations.deals.results']), rsuffix='.associations.deals.results').drop(columns=['associations.deals.results'])

# display(df)
          id                 createdAt                 updatedAt  archived     properties.createdate properties.description properties.hs_lastmodifieddate properties.hs_object_id properties.name properties.price id.associations.deals.results               type
0  133248644  2020-08-06T15:05:23.253Z  2020-08-06T15:05:23.253Z     False  2020-08-06T15:05:23.253Z                   None       2020-08-06T15:05:23.253Z               133248644    test product              100                    2762673039  line_item_to_deal
1  133345685  2020-08-06T18:29:06.773Z  2020-08-06T18:29:06.773Z     False  2020-08-06T18:29:06.773Z                   None       2020-08-06T18:29:06.773Z               133345685  TEST PRODUCT 2             2222                    2762673038  line_item_to_deal

回应

response = """[{"id": "133248644",
"associations": {"deals": {"results": [{"id": "2762673039",
                                          "type": "line_item_to_deal"}]}},
  "properties": {
            "createdate": "2020-08-06T15:05:23.253Z",
            "description": null,
            "hs_lastmodifieddate": "2020-08-06T15:05:23.253Z",
            "hs_object_id": "133248644",
            "name": "test product",
            "price": "100"},
 "createdAt": "2020-08-06T15:05:23.253Z",
 "updatedAt": "2020-08-06T15:05:23.253Z",
 "archived": false}, 
{"id": "133345685",
 "associations": {"deals": {"results": [{"id": "2762673038",
                                          "type": "line_item_to_deal"}]}},
 "properties": {
             "createdate": 
             "2020-08-06T18:29:06.773Z", 
             "description": null,
             "hs_lastmodifieddate": "2020-08-06T18:29:06.773Z",
             "hs_object_id": "133345685",
             "name": "TEST PRODUCT 2",
             "price": "2222"},
 "createdAt": "2020-08-06T18:29:06.773Z", 
 "updatedAt": "2020-08-06T18:29:06.773Z",
 "archived": false}]"""

如何处理Python和Pandas中嵌套的JSON？

How to deal with nested JSON in Python and Pandas?

python

json

pandas

json-normalize

回应