在不读入内存的情况下将 +10GB .csv 文件拆分为相等的部分

Splitting +10GB .csv File into equal Parts without reading into Memory

我有 3 个超过 10GB 的文件需要拆分成 6 个较小的文件。我通常会使用 R 之类的东西来加载文件并将其分成较小的块,但文件的大小阻止它们被读入 R - 即使有 20GB 的 RAM。

我不知道下一步该怎么做,非常感谢任何提示。

在python中,使用generators/iterators你不应该加载内存中的所有数据。

逐行阅读。

Csv 库为您提供 reader 和编写器 类,可以完成工作。

要拆分文件,您可以这样写:

import csv

# your input file (10GB)
in_csvfile = open('source.csv', "r")

# reader, that would read file for you line-by-line
reader = csv.DictReader(in_csvfile)

# number of current line read
num = 0

# number of output file
output_file_num = 1

# your output file
out_csvfile = open('out_{}.csv'.format(output_file_num), "w")

# writer should be constructed in a read loop, 
# because we need csv headers to be already available 
# to construct writer object
writer = None

for row in reader:
    num += 1

    # Here you have your data line in a row variable

    # If writer doesn't exists, create one
    if writer is None:
        writer = csv.DictWriter(
            out_csvfile, 
            fieldnames=row.keys(), 
            delimiter=",", quotechar='"', escapechar='"', 
            lineterminator='\n', quoting=csv.QUOTE_NONNUMERIC
        )

    # Write a row into a writer (out_csvfile, remember?)
    writer.writerow(row)

    # If we got a 10000 rows read, save current out file
    # and create a new one
    if num > 10000:
        output_file_num += 1
        out_csvfile.close()
        writer = None

        # create new file
        out_csvfile = open('out_{}.csv'.format(output_file_num), "w")

        # reset counter
        num = 0 

# Closing the files
in_csvfile.close()
out_csvfile.close()

我没有测试过,是我脑子里写的,所以,可能存在错误:)