使用 Dask DataFrames 展平 JSON
Flatten JSON with Dask DataFrames
我正在尝试在 Dask 数据帧中展平 JSON 数组对象(没有文件 .json),因为我有很多数据并且进程消耗了我的 RAM 运行 不断,所以我需要一个并行形式的解决方案。
那是 JSON 我有:
[ {
"id": "0001",
"name": "Stiven",
"location": [{
"country": "Colombia",
"department": "Choco",
"city": "Quibdo"
}, {
"country": "Colombia",
"department": "Antioquia",
"city": "Medellin"
}, {
"country": "Colombia",
"department": "Cundinamarca",
"city": "Bogota"
}
]
}, {
"id": "0002",
"name": "Jhon Jaime",
"location": [{
"country": "Colombia",
"department": "Valle del Cauca",
"city": "Cali"
}, {
"country": "Colombia",
"department": "Putumayo",
"city": "Mocoa"
}, {
"country": "Colombia",
"department": "Arauca",
"city": "Arauca"
}
]
}, {
"id": "0003",
"name": "Francisco",
"location": [{
"country": "Colombia",
"department": "Atlantico",
"city": "Barranquilla"
}, {
"country": "Colombia",
"department": "Bolivar",
"city": "Cartagena"
}, {
"country": "Colombia",
"department": "La Guajira",
"city": "Riohacha"
}
]
}
]
这是我的数据框:
index id name location
0 0001 Stiven [{'country':'Colombia', 'department': 'Choco', 'city': 'Quibdo'}, {'country':'Colombia', 'department': 'Antioquia', 'city': 'Medellin'}, {'country':'Colombia', 'department': 'Cundinamarca', 'city': 'Bogota'}]
1 0002 Jhon Jaime [{'country':'Colombia', 'department': 'Valle del Cauca', 'city': 'Cali'}, {'country':'Colombia', 'department': 'Putumayo', 'city': 'Mocoa'}, {'country':'Colombia', 'department': 'Arauca', 'city': 'Arauca'}]
2 0003 Francisco [{'country':'Colombia', 'department': 'Atlantico', 'city': 'Barranquilla'}, {'country':'Colombia', 'department': 'Bolivar', 'city': 'Cartagena'}, {'country':'Colombia', 'department': 'La Guajira', 'city': 'Riohacha'}]
我需要将每个 id 转换为数据帧,如下所示:
index id name country department city
0 0001 Stiven Colombia Choco Quibdo
1 0001 Stiven Colombia Antioquia Medellin
2 0001 Stiven Colombia Cundinamarca Bogota
3 0002 Jhon Jaime Colombia Valle del Cauca Cali
4 0002 Jhon Jaime Colombia Putumayo Mocoa
5 0002 Jhon Jaime Colombia Arauca Arauca
6 0003 Francisco Colombia Atlantico Barranquilla
7 0003 Francisco Colombia Bolivar Cartagena
8 0003 Francisco Colombia La Guajira Riohacha
所有进程必须与 Dask 并行。有什么推荐吗?
提前致谢。
我建议首先使用 Pandas 数据帧解决此问题,然后使用 .map_partitions
函数将该函数应用于 Dask 数据帧中的所有 Pandas 分区。
我正在尝试在 Dask 数据帧中展平 JSON 数组对象(没有文件 .json),因为我有很多数据并且进程消耗了我的 RAM 运行 不断,所以我需要一个并行形式的解决方案。
那是 JSON 我有:
[ {
"id": "0001",
"name": "Stiven",
"location": [{
"country": "Colombia",
"department": "Choco",
"city": "Quibdo"
}, {
"country": "Colombia",
"department": "Antioquia",
"city": "Medellin"
}, {
"country": "Colombia",
"department": "Cundinamarca",
"city": "Bogota"
}
]
}, {
"id": "0002",
"name": "Jhon Jaime",
"location": [{
"country": "Colombia",
"department": "Valle del Cauca",
"city": "Cali"
}, {
"country": "Colombia",
"department": "Putumayo",
"city": "Mocoa"
}, {
"country": "Colombia",
"department": "Arauca",
"city": "Arauca"
}
]
}, {
"id": "0003",
"name": "Francisco",
"location": [{
"country": "Colombia",
"department": "Atlantico",
"city": "Barranquilla"
}, {
"country": "Colombia",
"department": "Bolivar",
"city": "Cartagena"
}, {
"country": "Colombia",
"department": "La Guajira",
"city": "Riohacha"
}
]
}
]
这是我的数据框:
index id name location
0 0001 Stiven [{'country':'Colombia', 'department': 'Choco', 'city': 'Quibdo'}, {'country':'Colombia', 'department': 'Antioquia', 'city': 'Medellin'}, {'country':'Colombia', 'department': 'Cundinamarca', 'city': 'Bogota'}]
1 0002 Jhon Jaime [{'country':'Colombia', 'department': 'Valle del Cauca', 'city': 'Cali'}, {'country':'Colombia', 'department': 'Putumayo', 'city': 'Mocoa'}, {'country':'Colombia', 'department': 'Arauca', 'city': 'Arauca'}]
2 0003 Francisco [{'country':'Colombia', 'department': 'Atlantico', 'city': 'Barranquilla'}, {'country':'Colombia', 'department': 'Bolivar', 'city': 'Cartagena'}, {'country':'Colombia', 'department': 'La Guajira', 'city': 'Riohacha'}]
我需要将每个 id 转换为数据帧,如下所示:
index id name country department city
0 0001 Stiven Colombia Choco Quibdo
1 0001 Stiven Colombia Antioquia Medellin
2 0001 Stiven Colombia Cundinamarca Bogota
3 0002 Jhon Jaime Colombia Valle del Cauca Cali
4 0002 Jhon Jaime Colombia Putumayo Mocoa
5 0002 Jhon Jaime Colombia Arauca Arauca
6 0003 Francisco Colombia Atlantico Barranquilla
7 0003 Francisco Colombia Bolivar Cartagena
8 0003 Francisco Colombia La Guajira Riohacha
所有进程必须与 Dask 并行。有什么推荐吗?
提前致谢。
我建议首先使用 Pandas 数据帧解决此问题,然后使用 .map_partitions
函数将该函数应用于 Dask 数据帧中的所有 Pandas 分区。