将嵌套的 JSON 转换为 Python 中的 CSV 文件

Convert nested JSON to CSV file in Python

我知道这个问题已经被问过很多次了。我尝试了多种解决方案,但无法解决我的问题。

我有一个很大的嵌套 JSON 文件 (1.4GB),我想将其扁平化,然后将其转换为 CSV 文件。

JSON结构是这样的:

{
  "company_number": "12345678",
  "data": {
    "address": {
      "address_line_1": "Address 1",
      "locality": "Henley-On-Thames",
      "postal_code": "RG9 1DP",
      "premises": "161",
      "region": "Oxfordshire"
    },
    "country_of_residence": "England",
    "date_of_birth": {
      "month": 2,
      "year": 1977
    },
    "etag": "26281dhge33b22df2359sd6afsff2cb8cf62bb4a7f00",
    "kind": "individual-person-with-significant-control",
    "links": {
      "self": "/company/12345678/persons-with-significant-control/individual/bIhuKnFctSnjrDjUG8n3NgOrl"
    },
    "name": "John M Smith",
    "name_elements": {
      "forename": "John",
      "middle_name": "M",
      "surname": "Smith",
      "title": "Mrs"
    },
    "nationality": "Vietnamese",
    "natures_of_control": [
      "ownership-of-shares-50-to-75-percent"
    ],
    "notified_on": "2016-04-06"
  }
}

我知道这很容易用 pandas 模块完成,但我不熟悉它。

已编辑

所需的输出应该是这样的:

company_number, address_line_1, locality, country_of_residence, kind,

12345678, Address 1, Henley-On-Thamed, England, individual-person-with-significant-control

请注意,这只是简短版本。输出应该包含所有字段。

对于您提供的 JSON 数据,您可以通过将 JSON 结构解析为仅 return 所有叶节点的列表来实现。

这假定您的结构始终是一致的,如果每个条目可以有不同的字段,请参阅第二种方法。

例如:

import json
import csv

def get_leaves(item, key=None):
    if isinstance(item, dict):
        leaves = []
        for i in item.keys():
            leaves.extend(get_leaves(item[i], i))
        return leaves
    elif isinstance(item, list):
        leaves = []
        for i in item:
            leaves.extend(get_leaves(i, key))
        return leaves
    else:
        return [(key, item)]


with open('json.txt') as f_input, open('output.csv', 'w', newline='') as f_output:
    csv_output = csv.writer(f_output)
    write_header = True

    for entry in json.load(f_input):
        leaf_entries = sorted(get_leaves(entry))

        if write_header:
            csv_output.writerow([k for k, v in leaf_entries])
            write_header = False

        csv_output.writerow([v for k, v in leaf_entries])

如果您的 JSON 数据是您指定格式的条目列表,那么您应该得到如下输出:

address_line_1,company_number,country_of_residence,etag,forename,kind,locality,middle_name,month,name,nationality,natures_of_control,notified_on,postal_code,premises,region,self,surname,title,year
Address 1,12345678,England,26281dhge33b22df2359sd6afsff2cb8cf62bb4a7f00,John,individual-person-with-significant-control,Henley-On-Thames,M,2,John M Smith,Vietnamese,ownership-of-shares-50-to-75-percent,2016-04-06,RG9 1DP,161,Oxfordshire,/company/12345678/persons-with-significant-control/individual/bIhuKnFctSnjrDjUG8n3NgOrl,Smith,Mrs,1977
Address 1,12345679,England,26281dhge33b22df2359sd6afsff2cb8cf62bb4a7f00,John,individual-person-with-significant-control,Henley-On-Thames,M,2,John M Smith,Vietnamese,ownership-of-shares-50-to-75-percent,2016-04-06,RG9 1DP,161,Oxfordshire,/company/12345678/persons-with-significant-control/individual/bIhuKnFctSnjrDjUG8n3NgOrl,Smith,Mrs,1977

如果每个条目可以包含不同的(或可能缺失的)字段,那么更好的方法是使用 DictWriter。在这种情况下,需要处理所有条目以确定可能的 fieldnames 的完整列表,以便可以写入正确的 header。

import json
import csv

def get_leaves(item, key=None):
    if isinstance(item, dict):
        leaves = {}
        for i in item.keys():
            leaves.update(get_leaves(item[i], i))
        return leaves
    elif isinstance(item, list):
        leaves = {}
        for i in item:
            leaves.update(get_leaves(i, key))
        return leaves
    else:
        return {key : item}


with open('json.txt') as f_input:
    json_data = json.load(f_input)

# First parse all entries to get the complete fieldname list
fieldnames = set()

for entry in json_data:
    fieldnames.update(get_leaves(entry).keys())

with open('output.csv', 'w', newline='') as f_output:
    csv_output = csv.DictWriter(f_output, fieldnames=sorted(fieldnames))
    csv_output.writeheader()
    csv_output.writerows(get_leaves(entry) for entry in json_data)

您可以使用pandas库json_normalize函数将struct展平,然后随意处理。例如:

import pandas as pd
import json

raw = """[{
  "company_number": "12345678",
  "data": {
    "address": {
      "address_line_1": "Address 1",
      "locality": "Henley-On-Thames",
      "postal_code": "RG9 1DP",
      "premises": "161",
      "region": "Oxfordshire"
    },
    "country_of_residence": "England",
    "date_of_birth": {
      "month": 2,
      "year": 1977
    },
    "etag": "26281dhge33b22df2359sd6afsff2cb8cf62bb4a7f00",
    "kind": "individual-person-with-significant-control",
    "links": {
      "self": "/company/12345678/persons-with-significant-control/individual/bIhuKnFctSnjrDjUG8n3NgOrl"
    },
    "name": "John M Smith",
    "name_elements": {
      "forename": "John",
      "middle_name": "M",
      "surname": "Smith",
      "title": "Mrs"
    },
    "nationality": "Vietnamese",
    "natures_of_control": [
      "ownership-of-shares-50-to-75-percent"
    ],
    "notified_on": "2016-04-06"
  }
}]"""

data = json.loads(raw)
data = pd.json_normalize(data)
print(data.to_csv())

这给你:

,company_number,data.address.address_line_1,data.address.locality,data.address.postal_code,data.address.premises,data.address.region,data.country_of_residence,data.date_of_birth.month,data.date_of_birth.year,data.etag,data.kind,data.links.self,data.name,data.name_elements.forename,data.name_elements.middle_name,data.name_elements.surname,data.name_elements.title,data.nationality,data.natures_of_control,data.notified_on
0,12345678,Address 1,Henley-On-Thames,RG9 1DP,161,Oxfordshire,England,2,1977,26281dhge33b22df2359sd6afsff2cb8cf62bb4a7f00,individual-person-with-significant-control,/company/12345678/persons-with-significant-control/individual/bIhuKnFctSnjrDjUG8n3NgOrl,John M Smith,John,M,Smith,Mrs,Vietnamese,['ownership-of-shares-50-to-75-percent'],2016-04-06

请向下滚动以获得更新、更快的解决方案

这是一个较老的问题,但我为类似的情况苦苦挣扎了一整晚才得到满意的结果,我想出了这个:

import json
import pandas

def cross_join(left, right):
    return left.assign(key=1).merge(right.assign(key=1), on='key', how='outer').drop('key', 1)

def json_to_dataframe(data_in):
    def to_frame(data, prev_key=None):
        if isinstance(data, dict):
            df = pandas.DataFrame()
            for key in data:
                df = cross_join(df, to_frame(data[key], prev_key + '.' + key))
        elif isinstance(data, list):
            df = pandas.DataFrame()
            for i in range(len(data)):
                df = pandas.concat([df, to_frame(data[i], prev_key)])
        else:
            df = pandas.DataFrame({prev_key[1:]: [data]})
        return df
    return to_frame(data_in)

if __name__ == '__main__':
    with open('somefile') as json_file:
        json_data = json.load(json_file)

    df = json_to_dataframe(json_data)
    df.to_csv('data.csv', mode='w')

解释:

cross_join 函数是我发现做笛卡尔积的一种巧妙方法。 (信用:here

json_to_dataframe 函数执行逻辑,使用 pandas 数据帧。在我的例子中,json 嵌套很深,我想将字典 key:value 对拆分为列 ,但我想将 列表转换为列 的行——因此是 concat——然后我将其与上层交叉连接,从而乘以记录数,以便列表中的每个值都有自己的行,而前面的列是相同的。

递归创建与下面的交叉连接的堆栈,直到返回最后一个。

然后使用 table 格式的数据框,使用 "df.to_csv()" 数据框很容易转换为 CSV object方法。

这应该适用于深层嵌套 JSON,能够通过上述逻辑将所有内容规范化为行。

我希望有一天这会对某人有所帮助。只是想回馈这个很棒的社区。

---------------------------------------- ---------------------------------------------- ---

稍后编辑:新解决方案

我回到这个问题上,因为虽然数据框选项有点工作,但应用程序花了几分钟时间来解析不太大的 JSON 数据。因此我想做数据框做的事情,但我自己做:

from copy import deepcopy
import pandas


def cross_join(left, right):
    new_rows = [] if right else left
    for left_row in left:
        for right_row in right:
            temp_row = deepcopy(left_row)
            for key, value in right_row.items():
                temp_row[key] = value
            new_rows.append(deepcopy(temp_row))
    return new_rows


def flatten_list(data):
    for elem in data:
        if isinstance(elem, list):
            yield from flatten_list(elem)
        else:
            yield elem


def json_to_dataframe(data_in):
    def flatten_json(data, prev_heading=''):
        if isinstance(data, dict):
            rows = [{}]
            for key, value in data.items():
                rows = cross_join(rows, flatten_json(value, prev_heading + '.' + key))
        elif isinstance(data, list):
            rows = []
            for item in data:
                [rows.append(elem) for elem in flatten_list(flatten_json(item, prev_heading))]
        else:
            rows = [{prev_heading[1:]: data}]
        return rows

    return pandas.DataFrame(flatten_json(data_in))


if __name__ == '__main__':
    json_data = {
        "id": "0001",
        "type": "donut",
        "name": "Cake",
        "ppu": 0.55,
        "batters":
            {
                "batter":
                    [
                        {"id": "1001", "type": "Regular"},
                        {"id": "1002", "type": "Chocolate"},
                        {"id": "1003", "type": "Blueberry"},
                        {"id": "1004", "type": "Devil's Food"}
                    ]
            },
        "topping":
            [
                {"id": "5001", "type": "None"},
                {"id": "5002", "type": "Glazed"},
                {"id": "5005", "type": "Sugar"},
                {"id": "5007", "type": "Powdered Sugar"},
                {"id": "5006", "type": "Chocolate with Sprinkles"},
                {"id": "5003", "type": "Chocolate"},
                {"id": "5004", "type": "Maple"}
            ],
        "something": []
    }
    df = json_to_dataframe(json_data)
    print(df)

输出:

      id   type  name   ppu batters.batter.id batters.batter.type topping.id              topping.type
0   0001  donut  Cake  0.55              1001             Regular       5001                      None
1   0001  donut  Cake  0.55              1001             Regular       5002                    Glazed
2   0001  donut  Cake  0.55              1001             Regular       5005                     Sugar
3   0001  donut  Cake  0.55              1001             Regular       5007            Powdered Sugar
4   0001  donut  Cake  0.55              1001             Regular       5006  Chocolate with Sprinkles
5   0001  donut  Cake  0.55              1001             Regular       5003                 Chocolate
6   0001  donut  Cake  0.55              1001             Regular       5004                     Maple
7   0001  donut  Cake  0.55              1002           Chocolate       5001                      None
8   0001  donut  Cake  0.55              1002           Chocolate       5002                    Glazed
9   0001  donut  Cake  0.55              1002           Chocolate       5005                     Sugar
10  0001  donut  Cake  0.55              1002           Chocolate       5007            Powdered Sugar
11  0001  donut  Cake  0.55              1002           Chocolate       5006  Chocolate with Sprinkles
12  0001  donut  Cake  0.55              1002           Chocolate       5003                 Chocolate
13  0001  donut  Cake  0.55              1002           Chocolate       5004                     Maple
14  0001  donut  Cake  0.55              1003           Blueberry       5001                      None
15  0001  donut  Cake  0.55              1003           Blueberry       5002                    Glazed
16  0001  donut  Cake  0.55              1003           Blueberry       5005                     Sugar
17  0001  donut  Cake  0.55              1003           Blueberry       5007            Powdered Sugar
18  0001  donut  Cake  0.55              1003           Blueberry       5006  Chocolate with Sprinkles
19  0001  donut  Cake  0.55              1003           Blueberry       5003                 Chocolate
20  0001  donut  Cake  0.55              1003           Blueberry       5004                     Maple
21  0001  donut  Cake  0.55              1004        Devil's Food       5001                      None
22  0001  donut  Cake  0.55              1004        Devil's Food       5002                    Glazed
23  0001  donut  Cake  0.55              1004        Devil's Food       5005                     Sugar
24  0001  donut  Cake  0.55              1004        Devil's Food       5007            Powdered Sugar
25  0001  donut  Cake  0.55              1004        Devil's Food       5006  Chocolate with Sprinkles
26  0001  donut  Cake  0.55              1004        Devil's Food       5003                 Chocolate
27  0001  donut  Cake  0.55              1004        Devil's Food       5004                     Maple

根据上面的内容,cross_join 函数与数据框解决方案中的功能几乎相同,但没有数据框,因此速度更快.

我添加了 flatten_list 生成器,因为我想确保 JSON 数组都很好且扁平化,然后作为单个列表提供在分配给列表的每个值之前,由一次迭代中的前一个键组成的字典。这几乎模仿了这种情况下的 pandas.concat 行为。

main函数中的逻辑,json_to_dataframe就和之前一样了。所有需要改变的是将数据帧执行的操作作为编码函数。

此外,在数据框解决方案中,我没有将前一个标题附加到嵌套的 object,但除非您 100% 确定列名没有冲突,否则这几乎是强制性的。

希望对您有所帮助:)。

EDIT:修改了cross_join函数来处理嵌套列表为空的情况,基本保持先前的结果集未修改。即使在示例 JSON 数据中添加空 JSON 列表后,输出也没有改变。谢谢 @Nazmus Sakib 指出。

Referring to the answer of Bogdan Mircea,

代码几乎达到了我的目的! 但它 returns 每当它在嵌套 json.

中遇到空列表时,它就是一个空数据框

您可以通过将其放入代码中轻松解决此问题

elif isinstance(data, list):
        rows = []
        if(len(data) != 0):
            for i in range(len(data)):
                [rows.append(elem) for elem in flatten_list(flatten_json(data[i], prev_heading))]
        else:
            data.append(None)
            [rows.append(elem) for elem in flatten_list(flatten_json(data[0], prev_heading))]