许多巨大的 csv 文件的高效合并

Question

我有一个脚本，它获取一个目录中的所有 csv 文件，并使用外部连接并排合并它们。问题是当我尝试在我需要加入的文件上使用它时（大约两打文件，每个 6-12 Gb），我的计算机出现阻塞（MemoryError）。我知道 itertools 可用于提高循环效率，但我不清楚它是否或如何应用于这种情况。我能想到的另一种选择是安装 mySQL，学习基础知识，然后在那里执行此操作。显然，如果可能的话，我宁愿在 Python 中这样做，因为我已经在学习它了。基于 R 的解决方案也是可以接受的。

这是我的代码：

import os
import glob
import pandas as pd
os.chdir("\path\containing\files")

files = glob.glob("*.csv")
sdf = pd.read_csv(files[0], sep=',')

for filename in files[1:]:
    df = pd.read_csv(filename, sep=',')
    sdf = pd.merge(sdf, df, how='outer', on=['Factor1', 'Factor2'])

任何有关如何处理对我的计算机内存来说太大的文件的建议，我们将不胜感激。

Answer 1

有可能 dask 非常适合您的使用。这可能取决于合并后您想做什么。

Answer 2

你应该可以用 python 做到这一点，但我不认为立即读取 csv 是最有效地利用你的内存。

How to read a CSV file from a stream and process each line as it is written?

Answer 3

使用HDF5, that in my opinion would suit your needs very well. It also handles out-of-core queries，这样你就不用面对MemoryError。

import os
import glob
import pandas as pd
os.chdir("\path\containing\files")

files = glob.glob("*.csv")
hdf_path = 'my_concatenated_file.h5'

with pd.HDFStore(hdf_path, mode='w', complevel=5, complib='blosc') as store:
    # This compresses the final file by 5 using blosc. You can avoid that or
    # change it as per your needs.
    for filename in files:
        store.append('table_name', pd.read_csv(filename, sep=','), index=False)
    # Then create the indexes, if you need it
    store.create_table_index('table_name', columns=['Factor1', 'Factor2'], optlevel=9, kind='full')

许多巨大的 csv 文件的高效合并

Efficient merge for many huge csv files

python

merge

itertools

large-files

pandas