将 pandas read_csv 与 StringIO 而不是文件对象一起使用时，峰值内存使用量要大得多

Question

我有一个 600MB 的 CSV，我用 pandas' read_csv 使用以下两种方法之一加载它。

def read_my_csv1():
    df = pd.read_csv('my_data.csv')
    print(len(df))

def read_my_csv2():
    with open('my_data.csv') as f:
        file_contents = f.read()
    data_frame = pd.read_csv(io.StringIO(file_contents))
    print(len(data_frame))

第一种方法给出的内存使用峰值为 1GB。

第二种方法给出了 4GB 的峰值内存使用量。

我用 fil-profile 测量了峰值内存使用量。

怎么相差这么大？有没有一种方法可以从不会使内存使用量达到峰值的字符串中加载 CSV？

Answer 1

看起来 StringIO 维护着它自己的 copy 字符串数据，所以至少暂时你在内存中有你的数据的三份副本——一份在 file_contents，一份在StringIO 对象，以及最终数据框中的一个。同时，至少理论上 read_csv 可以逐行读取输入文件，因此在直接从文件读取时，在最终数据帧中只有一份完整数据。

您可以在创建 StringIO 对象后尝试 deleting file_contents，看看是否有改进。

Answer 2

How can the difference be so large?

StringIO 使用 Py_UCS4 [source]. That is a 32 bit datatype, while the CSV file is probably ASCII or UTF-8. So we have an overhead of factor 3 here, accounting for additional ~1.8 GB. Also, the StringIO buffer may overallocate for 12.5% [source].

类型的缓冲区

最佳情况：

file_contents    600 MB
io.StringIO     2400 MB
data_frame       600 MB (at least)
DLLs, EXEs, ...    ? MB
-----------------------
                3600 MB (at least)

12.5% 过度分配的案例：

file_contents    600 MB
io.StringIO     2700 MB
data_frame       600 MB (at least)
DLLs, EXEs, ...    ? MB
-----------------------
                3900 MB (at least)

Is there a way to load a CSV from a string that doesn't make peak memory usage go through the roof?

del临时对象
不要使用 StringIO。

将 pandas read_csv 与 StringIO 而不是文件对象一起使用时，峰值内存使用量要大得多

Peak memory usage much larger when using pandas read_csv with StringIO instead of a file object

python

csv

dataframe

pandas