将嵌套的 JSON 读入 Pandas DataFrame

Question

背景信息 -
我有一个来自 API 呼叫的 JSON 响应，我试图将其保存在 pandas DataFrame，同时保持相同的结构，就像我在系统中查看时调用的数据一样。

函数调用JSON响应-
def api_call():调用API（注意： url_list 目前只包含 1x url）并使用 json.loads(response.text)

将响应保存在 api_response 变量中

def api_call():
    url_list = url_constructor()
    for url in url_list:
        response = requests.get(url_list[0], auth = HTTPBasicAuth(key, secret), headers={"Firm":"583"})
    api_response = json.loads(response.text)
    return api_response

将响应保存到文件的功能以及 returns 它：
def response_writer(): 将 api_response 保存为 JSON 文件。它还 returns api_response.

def response_writer():
    api_response = api_call()
    timestr = datetime.datetime.now().strftime("%Y-%m-%d-%H:%M")
    filename = 'api_response_'+timestr+'.json'
    with open(filename, 'w') as output_data:
        json.dump(api_response, output_data)
        print("-------------------------------------------------------\n", 
              "API RESPONSE SAVED:", filename, "\n-------------------------------------------------------")
    return api_response

JSON 响应 -

{
  "meta": {
    "columns": [
      {
        "key": "node_id",
        "display_name": "Entity ID",
        "output_type": "Word"
      },
      {
        "key": "bottom_level_holding_account_number",
        "display_name": "Holding Account Number",
        "output_type": "Word"
      },
      {
        "key": "value",
        "display_name": "Adjusted Value (USD)",
        "output_type": "Number",
        "currency": "USD"
      },
      {
        "key": "node_ownership",
        "display_name": "% Ownership",
        "output_type": "Percent"
      },
      {
        "key": "model_type",
        "display_name": "Model Type",
        "output_type": "Word"
      },
      {
        "key": "valuation",
        "display_name": "Valuation (USD)",
        "output_type": "Number",
        "currency": "USD"
      },
      {
        "key": "_custom_jb_custodian_305769",
        "display_name": "JB Custodian",
        "output_type": "Word"
      },
      {
        "key": "top_level_owner",
        "display_name": "Top Level Owner",
        "output_type": "Word"
      },
      {
        "key": "top_level_legal_entity",
        "display_name": "Top Level Legal Entity",
        "output_type": "Word"
      },
      {
        "key": "direct_owner",
        "display_name": "Direct Owner",
        "output_type": "Word"
      },
      {
        "key": "online_status",
        "display_name": "Online Status",
        "output_type": "Word"
      },
      {
        "key": "financial_service",
        "display_name": "Financial Service",
        "output_type": "Word"
      },
      {
        "key": "_custom_placeholder_461415",
        "display_name": "Placeholder or Fee Basis",
        "output_type": "Boolean"
      },
      {
        "key": "_custom_close_date_411160",
        "display_name": "Account Close Date",
        "output_type": "Date"
      },
      {
        "key": "_custom_ownership_audit_note_425843",
        "display_name": "Ownership Audit Note",
        "output_type": "Word"
      }
    ],
    "groupings": [
      {
        "key": "holding_account",
        "display_name": "Holding Account"
      }
    ]
  },
  "data": {
    "type": "portfolio_views",
    "attributes": {
      "total": {
        "name": "Total",
        "columns": {
          "direct_owner": null,
          "node_ownership": null,
          "online_status": null,
          "_custom_ownership_audit_note_425843": null,
          "model_type": null,
          "_custom_placeholder_461415": null,
          "top_level_owner": null,
          "_custom_close_date_411160": null,
          "valuation": null,
          "bottom_level_holding_account_number": null,
          "_custom_jb_custodian_305769": null,
          "financial_service": null,
          "top_level_legal_entity": null,
          "value": null,
          "node_id": null
        },
        "children": [
          {
            "entity_id": 4754837,
            "name": "Apple Holdings Adv (748374923)",
            "grouping": "holding_account",
            "columns": {
              "direct_owner": "Apple Holdings LLC",
              "node_ownership": 1,
              "online_status": "Online",
              "_custom_ownership_audit_note_425843": null,
              "model_type": "Holding Account",
              "_custom_placeholder_461415": false,
              "top_level_owner": "Forsyth Family",
              "_custom_close_date_411160": null,
              "valuation": 10423695.609450001,
              "bottom_level_holding_account_number": "748374923",
              "_custom_jb_custodian_305769": "Laverockbank",
              "financial_service": "laverockbankcustodianservice",
              "top_level_legal_entity": "Apple Holdings LLC",
              "value": 10423695.609450001,
              "node_id": "4754837"
            },
          }
        ]
      }
    }
  },
  "included": []
}

Pandas DataFrame 中 JSON 的预期结构 -
这是我试图在 [=81] 中传达的结构=]数据框-

| Holding Account                 | Entity ID | Holding Account Number | Adjusted Value (USD) | % Ownership | Model Type      | Valuation (USD) | JB Custodian | Top Level Owner | Top Level Legal Entity          | Direct Owner                    | Online Status | Financial Service   | Placeholder or Fee Basis | Account Close Date | Ownership Audit Note |
|---------------------------------|-----------|------------------------|----------------------|-------------|-----------------|-----------------|--------------|-----------------|---------------------------------|---------------------------------|---------------|---------------------|--------------------------|--------------------|----------------------|
| Apple Holdings Adv (748374923)  | 4754837   | 748374923              | ,423,695.06       | 100.00%     | Holding Account | ,423,695.06  | BRF          | Forsyth Family  | Apple Holdings Partners LLC     | Apple Holdings Partners LLC     | Online        | custodianservice    | No                       | -                  | -                    |

我对JSON结构的解释-
看来我需要专注于{'columns:（其中有列headers)，以及 'data': 的 'children'（代表数据行，在我的例子中，只是 1x 行）。我可以忽略 'groupings': [{'key': 'holding_account', 'display_name': 'Holding Account'}]},，因为这最终是数据在系统中排序的方式。

有没有人建议我如何使用 JSON 并加载到具有演示结构的 DataFrame 中？

我的解释是，我需要将 display_names [columns] 设置为 headers，然后在每个 [=27= 下映射相应的 children 值] / headers。 注意： 通常，会有更多 children（代表我的 DataFrame 的每一行数据），但是我已经去掉了除 1x 之外的所有数据，以便于解释。

Answer 1

我不确定这是解压字典的最佳方式，但它有效：
（它用于保留 child “元数据”，如 id（重复）和持有帐户全名）

def unpack_dict(item, out):
    for k, v in item.items():
        if type(v) == dict:
            unpack_dict(v, out)
        else:
            out[k] = v
    return out

现在我们需要在每个 child 上使用它来获取数据

从你的例子来看，你似乎想保留 Holding 帐户（来自 child），但你不想要 entity_id，因为它在 [=34= 中重复了]?

不确定，所以我只包含所有列及其“原始”名称

columns = unpack_dict(res["data"]["attributes"]["total"]["children"][0]
children = res["data"]["attributes"]["total"]["children"]
data = []

for i in children:
    data.append(list(unpack_dict(i, {}).values()))

并从中创建数据框：

>>> pd.DataFrame(data=data, columns = columns)
   entity_id                            name  ...         value  node_id
0    4754837  Apple Holdings Adv (748374923)  ...  1.042370e+07  4754837

[1 rows x 18 columns]

现在可以更改为使用显示名称而不是这些原始名称。不过，您可能需要删除一些列，正如我上面提到的，id 是重复的，您提到了分组等。

如果您正在处理大量数据（数千个条目）并且解析它需要很长时间，可以在插入 data 之前删除多余的列以节省以后的时间。

使用 dict 重命名列：

df.rename(columns={'oldName1': 'newName1', 'oldName2': 'newName2'})

Answer 2

我建议使用 pd.json_normalize() ( https://pandas.pydata.org/pandas-docs/version/1.2.0/reference/api/pandas.json_normalize.html ) 这有助于将 JSON 数据转换为 pandas DataFrame。

注 1： 下面我假设数据在名为 data 的 python 字典中可用。出于测试目的，我使用了

import json
json_data = '''
{
  "meta": {
      # ....
  },
  #...
  "included": []
}
'''
data = json.loads(json_data)

其中 json_data 是您的 JSON 回复。由于 json.loads() 不接受最后的逗号，我在 children object.

之后省略了逗号

pd.json_normalize() 提供不同的选项。一种可能性是简单地读取所有“children”数据，然后删除不需要的列。此外，在规范化后，某些列具有前缀“列”。需要删除。

import pandas as pd
df = pd.json_normalize(data['data']['attributes']['total']['children'])
df.drop(columns=['grouping', 'entity_id'], inplace=True)
df.columns = df.columns.str.replace(r'columns.', '')

最后，需要将列名替换为“列”数据中的列名：

column_name_mapper = {column['key']: column['display_name'] for column in data['meta']['columns']}
df.rename(columns=column_name_mapper, inplace=True)

注2：与您描述的预期结构有一些细微的偏差。最值得注意的是，数据框 header 中的单词 'name'（行值为“Apple Holdings Adv (748374923)”）未更改为 'Holding Account'，因为在列中找不到这两个术语列表。描述的 JSON 响应与预期结构之间的一些其他值只是不同。

将嵌套的 JSON 读入 Pandas DataFrame

Read nested JSON into Pandas DataFrame

python

json

normalization

dataframe

pandas