打开大型 JSON 文件并将其转换为 CSV
Opening a large JSON file and converting it to CSV
我正在尝试将大型 JSON
文件 (4.35 GB) 转换为 CSV
。
我最初的方法是导入它,将其转换为数据框(我只需要 features
中的内容),进行一些数据操作,然后将其导出到 CSV
.
with open('Risk_of_Flooding_from_Rivers_and_Sea.json') as data_file:
d = json.load(data_file)
# Grabbing the data in 'features'.
json_df = json_normalize(d, 'features')
df = pd.DataFrame(json_df)
我已经成功地对整个数据集的小样本进行了此操作,但我无法一次导入整个数据集,即使将其 运行 放置 9 小时也是如此。即使我的 PC 有 16 GB 的 RAM,我假设这是一个内存问题,即使没有错误。
这是我正在使用的 JSON
数据的一小部分样本:
{
"type": "FeatureCollection",
"crs": {
"type": "name",
"properties": {
"name": "EPSG:27700"
}
},
"features": [
{
"type": "Feature",
"id": 1,
"geometry": {
"type": "Polygon",
"coordinates": [
[
[
289344.50009999985,
60397.26009999961
],
[
289347.2400000002,
60400
]
]
]
},
"properties": {
"OBJECTID": 1,
"prob_4band": "Low",
"suitability": "National to County",
"pub_date": 1522195200000,
"shape_Length": 112.16436096255808,
"shape_Area": 353.4856092588217
}
},
{
"type": "Feature",
"id": 2,
"geometry": {
"type": "Polygon",
"coordinates": [
[
[
289250,
60550
],
[
289200,
60550
]
]
]
},
"properties": {
"OBJECTID": 2,
"prob_4band": "Very Low",
"suitability": "National to County",
"pub_date": 1522195200000,
"shape_Length": 985.6295076665662,
"shape_Area": 18755.1377842949
}
},
我考虑过将 JSON
文件拆分成更小的块,但我的尝试没有成功。使用下面的代码我得到了错误
JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1).
with open(os.path.join('E:/Jupyter', 'Risk_of_Flooding_from_Rivers_and_Sea.json'), 'r',
encoding='utf-8') as f1:
ll = [json.loads(line.strip()) for line in f1.readlines()]
print(len(ll))
size_of_the_split = 10000
total = len(ll) // size_of_the_split
print(total+1)
for i in range(total+1):
json.dump(ll[i * size_of_the_split:(i + 1) * size_of_the_split], open(
"E:/Jupyter/split" + str(i+1) + ".json", 'w',
encoding='utf-8'), ensure_ascii=False, indent=True)
我只是想知道我的选择是什么。我这样做的方式是最好的方式吗?如果是,我能改变什么?
我从 this source 得到较小的样本,但它们不能太大。
要拆分数据,您可以使用流式解析器,例如 ijson 例如
import ijson
import itertools
import json
chunk_size = 10_000
filename = 'Risk_of_Flooding_from_Rivers_and_Sea.json'
with open(filename, mode='rb') as file_in:
features = ijson.items(file_in, 'features.item', use_float=True)
chunk = list(itertools.islice(features, chunk_size))
count = 1
while chunk:
with open(f'features-split-{count}.json', mode='w') as file_out:
json.dump(chunk, file_out, ensure_ascii=False, indent=4)
chunk = list(itertools.islice(features, chunk_size))
count += 1
我正在尝试将大型 JSON
文件 (4.35 GB) 转换为 CSV
。
我最初的方法是导入它,将其转换为数据框(我只需要 features
中的内容),进行一些数据操作,然后将其导出到 CSV
.
with open('Risk_of_Flooding_from_Rivers_and_Sea.json') as data_file:
d = json.load(data_file)
# Grabbing the data in 'features'.
json_df = json_normalize(d, 'features')
df = pd.DataFrame(json_df)
我已经成功地对整个数据集的小样本进行了此操作,但我无法一次导入整个数据集,即使将其 运行 放置 9 小时也是如此。即使我的 PC 有 16 GB 的 RAM,我假设这是一个内存问题,即使没有错误。
这是我正在使用的 JSON
数据的一小部分样本:
{
"type": "FeatureCollection",
"crs": {
"type": "name",
"properties": {
"name": "EPSG:27700"
}
},
"features": [
{
"type": "Feature",
"id": 1,
"geometry": {
"type": "Polygon",
"coordinates": [
[
[
289344.50009999985,
60397.26009999961
],
[
289347.2400000002,
60400
]
]
]
},
"properties": {
"OBJECTID": 1,
"prob_4band": "Low",
"suitability": "National to County",
"pub_date": 1522195200000,
"shape_Length": 112.16436096255808,
"shape_Area": 353.4856092588217
}
},
{
"type": "Feature",
"id": 2,
"geometry": {
"type": "Polygon",
"coordinates": [
[
[
289250,
60550
],
[
289200,
60550
]
]
]
},
"properties": {
"OBJECTID": 2,
"prob_4band": "Very Low",
"suitability": "National to County",
"pub_date": 1522195200000,
"shape_Length": 985.6295076665662,
"shape_Area": 18755.1377842949
}
},
我考虑过将 JSON
文件拆分成更小的块,但我的尝试没有成功。使用下面的代码我得到了错误
JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1).
with open(os.path.join('E:/Jupyter', 'Risk_of_Flooding_from_Rivers_and_Sea.json'), 'r',
encoding='utf-8') as f1:
ll = [json.loads(line.strip()) for line in f1.readlines()]
print(len(ll))
size_of_the_split = 10000
total = len(ll) // size_of_the_split
print(total+1)
for i in range(total+1):
json.dump(ll[i * size_of_the_split:(i + 1) * size_of_the_split], open(
"E:/Jupyter/split" + str(i+1) + ".json", 'w',
encoding='utf-8'), ensure_ascii=False, indent=True)
我只是想知道我的选择是什么。我这样做的方式是最好的方式吗?如果是,我能改变什么?
我从 this source 得到较小的样本,但它们不能太大。
要拆分数据,您可以使用流式解析器,例如 ijson 例如
import ijson
import itertools
import json
chunk_size = 10_000
filename = 'Risk_of_Flooding_from_Rivers_and_Sea.json'
with open(filename, mode='rb') as file_in:
features = ijson.items(file_in, 'features.item', use_float=True)
chunk = list(itertools.islice(features, chunk_size))
count = 1
while chunk:
with open(f'features-split-{count}.json', mode='w') as file_out:
json.dump(chunk, file_out, ensure_ascii=False, indent=4)
chunk = list(itertools.islice(features, chunk_size))
count += 1