打开大型 JSON 文件并将其转换为 CSV

Question

我正在尝试将大型 JSON 文件 (4.35 GB) 转换为 CSV。

我最初的方法是导入它，将其转换为数据框（我只需要 features 中的内容），进行一些数据操作，然后将其导出到 CSV.

with open('Risk_of_Flooding_from_Rivers_and_Sea.json') as data_file:    
    d = json.load(data_file)  

# Grabbing the data in 'features'.
json_df = json_normalize(d, 'features')
df = pd.DataFrame(json_df)

我已经成功地对整个数据集的小样本进行了此操作，但我无法一次导入整个数据集，即使将其运行放置 9 小时也是如此。即使我的 PC 有 16 GB 的 RAM，我假设这是一个内存问题，即使没有错误。

这是我正在使用的 JSON 数据的一小部分样本：

{
    "type": "FeatureCollection",
    "crs": {
        "type": "name",
        "properties": {
            "name": "EPSG:27700"
        }
    },
    "features": [
        {
            "type": "Feature",
            "id": 1,
            "geometry": {
                "type": "Polygon",
                "coordinates": [
                    [
                        [
                            289344.50009999985,
                            60397.26009999961
                        ],
                        [
                            289347.2400000002,
                            60400
                        ]
                    ]
                ]
            },
            "properties": {
                "OBJECTID": 1,
                "prob_4band": "Low",
                "suitability": "National to County",
                "pub_date": 1522195200000,
                "shape_Length": 112.16436096255808,
                "shape_Area": 353.4856092588217
            }
        },
        {
            "type": "Feature",
            "id": 2,
            "geometry": {
                "type": "Polygon",
                "coordinates": [
                    [
                        [
                            289250,
                            60550
                        ],
                        [
                            289200,
                            60550
                        ]
                    ]
                ]
            },
            "properties": {
                "OBJECTID": 2,
                "prob_4band": "Very Low",
                "suitability": "National to County",
                "pub_date": 1522195200000,
                "shape_Length": 985.6295076665662,
                "shape_Area": 18755.1377842949
            }
        },

我考虑过将 JSON 文件拆分成更小的块，但我的尝试没有成功。使用下面的代码我得到了错误

JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1).

with open(os.path.join('E:/Jupyter', 'Risk_of_Flooding_from_Rivers_and_Sea.json'), 'r',
          encoding='utf-8') as f1:
    ll = [json.loads(line.strip()) for line in f1.readlines()]
    
    print(len(ll))
          
    size_of_the_split = 10000
    total = len(ll) // size_of_the_split
          
    print(total+1)
          
    for i in range(total+1):
        json.dump(ll[i * size_of_the_split:(i + 1) * size_of_the_split], open(
            "E:/Jupyter/split" + str(i+1) + ".json", 'w',
            encoding='utf-8'), ensure_ascii=False, indent=True)

我只是想知道我的选择是什么。我这样做的方式是最好的方式吗？如果是，我能改变什么？

我从 this source 得到较小的样本，但它们不能太大。

Answer 1

要拆分数据，您可以使用流式解析器，例如 ijson 例如

import ijson
import itertools
import json

chunk_size = 10_000

filename = 'Risk_of_Flooding_from_Rivers_and_Sea.json'

with open(filename, mode='rb') as file_in:
    features = ijson.items(file_in, 'features.item', use_float=True)
    chunk = list(itertools.islice(features, chunk_size))
    count = 1
    while chunk:
        with open(f'features-split-{count}.json', mode='w') as file_out:
            json.dump(chunk, file_out, ensure_ascii=False, indent=4)
        chunk = list(itertools.islice(features, chunk_size))
        count += 1

打开大型 JSON 文件并将其转换为 CSV

Opening a large JSON file and converting it to CSV

python

csv

json

large-data

pandas