如何读取 tsv 文件并将它们存储为 hdf5 而不会 运行 内存不足?
How can I read tsv files and store them as hdf5 without running out of memory?
我有几个大于 10 GB 的数据集(tsv 格式),我需要以 hdf5 格式。我正在与 Python 合作。我读过 Pandas 包在读取文件并将它们存储为 hdf5 时不会占用太多内存。不过,如果我的机器 运行 内存不足,我无法这样做。我也尝试过 Spark,但在那里感觉不自在。那么,除了读取内存中的大文件之外,我还有什么替代解决方案?
import pandas as pd
import numpy as np
# I use python3.4
# if your python version is 2.x, replace it with 'import StringIO'
import io
# generate some 'large' tsv
raw_data = pd.DataFrame(np.random.randn(10000, 5), columns='A B C D E'.split())
raw_tsv = raw_data.to_csv(sep='\t')
# start to read csv in chunks, 50 rows per chunk (adjust it to the potential of your PC)
# the use of StringIO is just to provide a string buffer, you don't need this
# if you are reading from an external file, just put the file path there
file_reader = pd.read_csv(filepath_or_buffer=io.StringIO(raw_tsv), sep='\t', chunksize=50)
# try to show you what's inside each chunk
# if you type: list(file_reader)[0]
# exactly 50 rows
# don't do this in your real processing, file_reader is a lazy generator
# and it can only be consumed once
Unnamed: 0 A B C D E
0 0 -1.2553 0.1386 0.6201 0.1014 -0.4067
1 1 -1.0127 -0.8122 -0.0850 -0.1887 -0.9169
2 2 0.5512 0.7816 0.0729 -1.1310 -0.8213
3 3 0.1159 1.1608 -0.4519 -2.1344 0.1520
4 4 -0.5375 -0.6034 0.7518 -0.8381 0.3100
5 5 0.5895 0.5698 -0.9438 3.4536 0.5415
6 6 -1.2809 0.5412 0.5298 -0.8242 1.8116
7 7 0.7242 -1.6750 1.0408 -0.1195 0.6617
8 8 -1.4313 -0.4498 -1.6069 -0.7309 -1.1688
9 9 -0.3073 0.3158 0.6478 -0.6361 -0.7203
.. ... ... ... ... ... ...
40 40 -0.3143 -1.9459 0.0877 -0.0310 -2.3967
41 41 -0.8487 0.1104 1.2564 1.0890 0.6501
42 42 1.6665 -0.0094 -0.0889 1.3877 0.7752
43 43 0.9872 -1.5167 0.0059 0.4917 1.8728
44 44 0.4096 -1.2913 1.7731 0.3443 1.0094
45 45 -0.2633 1.8474 -1.0781 -1.4475 -0.2212
46 46 -0.2872 -0.0600 0.0958 -0.2526 0.1531
47 47 -0.7517 -0.1358 -0.5520 -1.0533 -1.0962
48 48 0.8421 -0.8751 0.5380 0.7147 1.0812
49 49 -0.8216 1.0702 0.8911 0.5189 -0.1725
[50 rows x 6 columns]
# set up your HDF5 file with highest possible compress ratio 9
h5_file = pd.HDFStore('your_hdf5_file.h5', complevel=9, complib='blosc')
h5_file
Out[18]:
<class 'pandas.io.pytables.HDFStore'>
File path: your_hdf5_file.h5
Empty
# now, start processing
for df_chunk in file_reader:
# must use append method
h5_file.append('big_data', df_chunk, complevel=9, complib='blosc')
# after processing, close hdf5 file
h5_file.close()
# check your hdf5 file,
pd.HDFStore('your_hdf5_file.h5')
# now it has all 10,000 rows, and we did this chunk by chunk
Out[21]:
<class 'pandas.io.pytables.HDFStore'>
File path: your_hdf5_file.h5
/big_data frame_table (typ->appendable,nrows->10000,ncols->6,indexers->[index])
我有几个大于 10 GB 的数据集(tsv 格式),我需要以 hdf5 格式。我正在与 Python 合作。我读过 Pandas 包在读取文件并将它们存储为 hdf5 时不会占用太多内存。不过,如果我的机器 运行 内存不足,我无法这样做。我也尝试过 Spark,但在那里感觉不自在。那么,除了读取内存中的大文件之外,我还有什么替代解决方案?
import pandas as pd
import numpy as np
# I use python3.4
# if your python version is 2.x, replace it with 'import StringIO'
import io
# generate some 'large' tsv
raw_data = pd.DataFrame(np.random.randn(10000, 5), columns='A B C D E'.split())
raw_tsv = raw_data.to_csv(sep='\t')
# start to read csv in chunks, 50 rows per chunk (adjust it to the potential of your PC)
# the use of StringIO is just to provide a string buffer, you don't need this
# if you are reading from an external file, just put the file path there
file_reader = pd.read_csv(filepath_or_buffer=io.StringIO(raw_tsv), sep='\t', chunksize=50)
# try to show you what's inside each chunk
# if you type: list(file_reader)[0]
# exactly 50 rows
# don't do this in your real processing, file_reader is a lazy generator
# and it can only be consumed once
Unnamed: 0 A B C D E
0 0 -1.2553 0.1386 0.6201 0.1014 -0.4067
1 1 -1.0127 -0.8122 -0.0850 -0.1887 -0.9169
2 2 0.5512 0.7816 0.0729 -1.1310 -0.8213
3 3 0.1159 1.1608 -0.4519 -2.1344 0.1520
4 4 -0.5375 -0.6034 0.7518 -0.8381 0.3100
5 5 0.5895 0.5698 -0.9438 3.4536 0.5415
6 6 -1.2809 0.5412 0.5298 -0.8242 1.8116
7 7 0.7242 -1.6750 1.0408 -0.1195 0.6617
8 8 -1.4313 -0.4498 -1.6069 -0.7309 -1.1688
9 9 -0.3073 0.3158 0.6478 -0.6361 -0.7203
.. ... ... ... ... ... ...
40 40 -0.3143 -1.9459 0.0877 -0.0310 -2.3967
41 41 -0.8487 0.1104 1.2564 1.0890 0.6501
42 42 1.6665 -0.0094 -0.0889 1.3877 0.7752
43 43 0.9872 -1.5167 0.0059 0.4917 1.8728
44 44 0.4096 -1.2913 1.7731 0.3443 1.0094
45 45 -0.2633 1.8474 -1.0781 -1.4475 -0.2212
46 46 -0.2872 -0.0600 0.0958 -0.2526 0.1531
47 47 -0.7517 -0.1358 -0.5520 -1.0533 -1.0962
48 48 0.8421 -0.8751 0.5380 0.7147 1.0812
49 49 -0.8216 1.0702 0.8911 0.5189 -0.1725
[50 rows x 6 columns]
# set up your HDF5 file with highest possible compress ratio 9
h5_file = pd.HDFStore('your_hdf5_file.h5', complevel=9, complib='blosc')
h5_file
Out[18]:
<class 'pandas.io.pytables.HDFStore'>
File path: your_hdf5_file.h5
Empty
# now, start processing
for df_chunk in file_reader:
# must use append method
h5_file.append('big_data', df_chunk, complevel=9, complib='blosc')
# after processing, close hdf5 file
h5_file.close()
# check your hdf5 file,
pd.HDFStore('your_hdf5_file.h5')
# now it has all 10,000 rows, and we did this chunk by chunk
Out[21]:
<class 'pandas.io.pytables.HDFStore'>
File path: your_hdf5_file.h5
/big_data frame_table (typ->appendable,nrows->10000,ncols->6,indexers->[index])