我应该使用 Python pandas 还是在 C 中编写自定义代码来读取和过滤数 GB CSV 文件中的数据？

Question

我有一个 13 GB 的 CSV 文件，我需要读取文件并从中过滤数据。我正在使用 pandas 并分块读取它，但它花费的时间太长了。 python 中是否有任何其他库比 pandas 更快，或者用 C 编写自定义代码是更好的选择？

我正在使用以下代码：

input_df=pd.read_csv("input file",chunksize=60000)
frames=[]
for i in input_df:
    filter_df=i[i["Column1"].str.contains("given string")|i["column2"].str.contains("given string")|i["column3"].str.contains("given string")]
    frame=pd.DataFrame(filter_df)
    frames.append(frame)
output_df=pd.concat(frames)
output_df.to_csv('output.csv',index=False)

我有 8 GB 内存，所以必须分块读取数据。

Answer 1

Pandas 和 Numpy 是使用 C 构建的，所以我看不出你将如何获得更好的速度，即使你用纯 C 编写代码而不是编写糟糕的 C 代码可能会把它搞得更糟。
尝试专注于改进您的算法或您当前阅读它的方式。

如果您只想读取 CSV 文件并根据它是否包含特定字符串来过滤数据，那么我认为逐行读取是更好的方法。

# store your results here
result = {"col1":[], "col2":[], "col3":[]}
to_check = "some string"
reset_after = 1000
current_line = 0

fp = open('filename.csv', 'r')
while ((line := fp.readline()) != ''):
    current_line += 1

    # Now, create a dataframe out of current result dictionary and save it
    df = DataFrame(result)
    df.to_csv("result_file.csv", mode='a', index=False, header=False)

    # reset after saving every reset_after line has reached
    if current_line >= reset_after:
        result = {"col1":[], "col2":[], "col3":[]}

    val1, val2, val3 = line.split(",")
    if (col1 == to_check) or (col2 == to_check) or (col3 == to_check):
        result['col1'].append(val1)
        result['col2'].append(val2)
        result['col3'].append(val3)

我应该使用 Python pandas 还是在 C 中编写自定义代码来读取和过滤数 GB CSV 文件中的数据？

Should I use Python pandas or write custom code in C for reading and filtering data from a multi-gigabyte CSV file?

c

python

csv

optimization

pandas