使用 Pandas 对大文件进行切片、删除重复项并合并到输出中
Slicing a large file, removing duplicates and merging into output using Pandas
所以,我有一个包含 12.5 亿个特征的地理包。该文件实际上并不包含几何图形,只有一个属性 'id',这是一个唯一的 ID。有很多重复项,我想删除重复项 'id' 并仅保留唯一值。由于存在大量数据(geopackage 包含 19 GB),我选择了切片。我尝试了 multiprocessing 但这没有用,而且它会出现问题,因为我必须跟踪唯一的 'id' 而 multiprocessing 不允许这样做(至少据我所知)。
我有:
import fiona
import geopandas as gpd
import pandas as pd
# import numpy as np
slice_count = 200
start = 0
end = slice_count
fname = "path/Output.gpkg"
file_gpd = gpd.read_file(fname, rows=slice(start, end))
chunk = pd.DataFrame(file_gpd)
chunks = pd.DataFrame()
only_ids = pd.DataFrame(columns=['id'])
loop = True
while loop:
try:
# Dropping duplicates in current dataset
chunk = chunk.drop_duplicates(subset=['id'])
# Extract only unique IDS from chunk variable to save memory
only_ids_in_chunk = pd.DataFrame()
only_ids_in_chunk['id'] = chunk['id']
only_ids = only_ids.append(only_ids_in_chunk)
only_ids = only_ids.drop_duplicates(subset=['id'])
# If we want to make another file which have all values unique
# we must store somewhere what we have in chunk variable, to be able to load new chunk
# Because we must not have all chunks in memory at the same time
del chunk
# Load next chunk
start += slice_count
end += slice_count
file_gpd = gpd.read_file(fname, rows=slice(start, end))
chunk = pd.DataFrame(file_gpd)
if len(chunk) == 0:
print(len(only_ids))
loop = False
else:
pass
except Exception:
loop = False
print("Iteration is stopped")
我遇到了无限循环。我以为使用 if 语句会找到 chunk 的长度何时等于 0 或者切片何时结束。
所以,这是最终的脚本。我遇到的问题是,当您使用 geopandas 对 geopackage 文件进行切片时,当您到达终点时,它会从头开始并且不会停止。所以我在代码末尾添加了 if 语句来覆盖它。
import fiona
import geopandas as gpd
import pandas as pd
import logging
import time
slice_count = 20000000
start = 0
end = slice_count
fname = "/Output.gpkg"
chunk = gpd.read_file(fname, rows=slice(start, end), ignore_geometry=True)
chunks = pd.DataFrame()
only_ids = pd.DataFrame(columns=['id'])
loop = True
chunk_num = 1
while loop:
start_time = time.time()
# Dropping duplicates in current dataset
chunk = chunk.drop_duplicates(subset=['id'])
only_ids = only_ids.append(chunk)
only_ids = only_ids.drop_duplicates(subset=['id'])
# delete chunk to save memory
del chunk
# Load next chunk
start += slice_count
end += slice_count
chunk = gpd.read_file(fname, rows=slice(start, end), ignore_geometry=True)
FORMAT = '%(asctime)s:%(name)s:%(levelname)s - %(message)s'
logging.basicConfig(format=FORMAT, level=logging.INFO)
logging.info(f"Chunk {chunk_num} done")
print(f"Duration: {time.time() - start_time}")
chunk_num += 1
if len(chunk) != slice_count:
chunk = chunk.drop_duplicates(subset=['id'])
only_ids = only_ids.append(chunk)
only_ids = only_ids.drop_duplicates(subset=['id'])
del chunk
break
only_ids.to_csv('output.csv')
所以,我有一个包含 12.5 亿个特征的地理包。该文件实际上并不包含几何图形,只有一个属性 'id',这是一个唯一的 ID。有很多重复项,我想删除重复项 'id' 并仅保留唯一值。由于存在大量数据(geopackage 包含 19 GB),我选择了切片。我尝试了 multiprocessing 但这没有用,而且它会出现问题,因为我必须跟踪唯一的 'id' 而 multiprocessing 不允许这样做(至少据我所知)。
我有:
import fiona
import geopandas as gpd
import pandas as pd
# import numpy as np
slice_count = 200
start = 0
end = slice_count
fname = "path/Output.gpkg"
file_gpd = gpd.read_file(fname, rows=slice(start, end))
chunk = pd.DataFrame(file_gpd)
chunks = pd.DataFrame()
only_ids = pd.DataFrame(columns=['id'])
loop = True
while loop:
try:
# Dropping duplicates in current dataset
chunk = chunk.drop_duplicates(subset=['id'])
# Extract only unique IDS from chunk variable to save memory
only_ids_in_chunk = pd.DataFrame()
only_ids_in_chunk['id'] = chunk['id']
only_ids = only_ids.append(only_ids_in_chunk)
only_ids = only_ids.drop_duplicates(subset=['id'])
# If we want to make another file which have all values unique
# we must store somewhere what we have in chunk variable, to be able to load new chunk
# Because we must not have all chunks in memory at the same time
del chunk
# Load next chunk
start += slice_count
end += slice_count
file_gpd = gpd.read_file(fname, rows=slice(start, end))
chunk = pd.DataFrame(file_gpd)
if len(chunk) == 0:
print(len(only_ids))
loop = False
else:
pass
except Exception:
loop = False
print("Iteration is stopped")
我遇到了无限循环。我以为使用 if 语句会找到 chunk 的长度何时等于 0 或者切片何时结束。
所以,这是最终的脚本。我遇到的问题是,当您使用 geopandas 对 geopackage 文件进行切片时,当您到达终点时,它会从头开始并且不会停止。所以我在代码末尾添加了 if 语句来覆盖它。
import fiona
import geopandas as gpd
import pandas as pd
import logging
import time
slice_count = 20000000
start = 0
end = slice_count
fname = "/Output.gpkg"
chunk = gpd.read_file(fname, rows=slice(start, end), ignore_geometry=True)
chunks = pd.DataFrame()
only_ids = pd.DataFrame(columns=['id'])
loop = True
chunk_num = 1
while loop:
start_time = time.time()
# Dropping duplicates in current dataset
chunk = chunk.drop_duplicates(subset=['id'])
only_ids = only_ids.append(chunk)
only_ids = only_ids.drop_duplicates(subset=['id'])
# delete chunk to save memory
del chunk
# Load next chunk
start += slice_count
end += slice_count
chunk = gpd.read_file(fname, rows=slice(start, end), ignore_geometry=True)
FORMAT = '%(asctime)s:%(name)s:%(levelname)s - %(message)s'
logging.basicConfig(format=FORMAT, level=logging.INFO)
logging.info(f"Chunk {chunk_num} done")
print(f"Duration: {time.time() - start_time}")
chunk_num += 1
if len(chunk) != slice_count:
chunk = chunk.drop_duplicates(subset=['id'])
only_ids = only_ids.append(chunk)
only_ids = only_ids.drop_duplicates(subset=['id'])
del chunk
break
only_ids.to_csv('output.csv')