如何将大于 RAM 限制的 gzip 文件导入 Pandas DataFrame? "Kill 9" 使用 HDF5?
How to import a gzip file larger than RAM limit into a Pandas DataFrame? "Kill 9" Use HDF5?
我有一个 gzip
,大约 90 GB。这完全在磁盘 space 内,但比 RAM 大得多。
如何将其导入 pandas 数据框?我在命令行中尝试了以下操作:
# start with Python 3.4.5
import pandas as pd
filename = 'filename.gzip' # size 90 GB
df = read_table(filename, compression='gzip')
但是,几分钟后,Python 关闭并显示 Kill 9
。
定义数据库对象df
后,我打算将其保存到HDF5中。
正确的做法是什么?我如何使用 pandas.read_table()
来做到这一点?
我会这样做:
filename = 'filename.gzip' # size 90 GB
hdf_fn = 'result.h5'
hdf_key = 'my_huge_df'
cols = ['colA','colB','colC','ColZ'] # put here a list of all your columns
cols_to_index = ['colA','colZ'] # put here the list of YOUR columns, that you want to index
chunksize = 10**6 # you may want to adjust it ...
store = pd.HDFStore(hdf_fn)
for chunk in pd.read_table(filename, compression='gzip', header=None, names=cols, chunksize=chunksize):
# don't index data columns in each iteration - we'll do it later
store.append(hdf_key, chunk, data_columns=cols_to_index, index=False)
# index data columns in HDFStore
store.create_table_index(hdf_key, columns=cols_to_index, optlevel=9, kind='full')
store.close()
我有一个 gzip
,大约 90 GB。这完全在磁盘 space 内,但比 RAM 大得多。
如何将其导入 pandas 数据框?我在命令行中尝试了以下操作:
# start with Python 3.4.5
import pandas as pd
filename = 'filename.gzip' # size 90 GB
df = read_table(filename, compression='gzip')
但是,几分钟后,Python 关闭并显示 Kill 9
。
定义数据库对象df
后,我打算将其保存到HDF5中。
正确的做法是什么?我如何使用 pandas.read_table()
来做到这一点?
我会这样做:
filename = 'filename.gzip' # size 90 GB
hdf_fn = 'result.h5'
hdf_key = 'my_huge_df'
cols = ['colA','colB','colC','ColZ'] # put here a list of all your columns
cols_to_index = ['colA','colZ'] # put here the list of YOUR columns, that you want to index
chunksize = 10**6 # you may want to adjust it ...
store = pd.HDFStore(hdf_fn)
for chunk in pd.read_table(filename, compression='gzip', header=None, names=cols, chunksize=chunksize):
# don't index data columns in each iteration - we'll do it later
store.append(hdf_key, chunk, data_columns=cols_to_index, index=False)
# index data columns in HDFStore
store.create_table_index(hdf_key, columns=cols_to_index, optlevel=9, kind='full')
store.close()