在不读入内存的情况下将 +10GB .csv 文件拆分为相等的部分
Splitting +10GB .csv File into equal Parts without reading into Memory
我有 3 个超过 10GB 的文件需要拆分成 6 个较小的文件。我通常会使用 R 之类的东西来加载文件并将其分成较小的块,但文件的大小阻止它们被读入 R - 即使有 20GB 的 RAM。
我不知道下一步该怎么做,非常感谢任何提示。
在python中,使用generators/iterators你不应该加载内存中的所有数据。
逐行阅读。
Csv 库为您提供 reader 和编写器 类,可以完成工作。
要拆分文件,您可以这样写:
import csv
# your input file (10GB)
in_csvfile = open('source.csv', "r")
# reader, that would read file for you line-by-line
reader = csv.DictReader(in_csvfile)
# number of current line read
num = 0
# number of output file
output_file_num = 1
# your output file
out_csvfile = open('out_{}.csv'.format(output_file_num), "w")
# writer should be constructed in a read loop,
# because we need csv headers to be already available
# to construct writer object
writer = None
for row in reader:
num += 1
# Here you have your data line in a row variable
# If writer doesn't exists, create one
if writer is None:
writer = csv.DictWriter(
out_csvfile,
fieldnames=row.keys(),
delimiter=",", quotechar='"', escapechar='"',
lineterminator='\n', quoting=csv.QUOTE_NONNUMERIC
)
# Write a row into a writer (out_csvfile, remember?)
writer.writerow(row)
# If we got a 10000 rows read, save current out file
# and create a new one
if num > 10000:
output_file_num += 1
out_csvfile.close()
writer = None
# create new file
out_csvfile = open('out_{}.csv'.format(output_file_num), "w")
# reset counter
num = 0
# Closing the files
in_csvfile.close()
out_csvfile.close()
我没有测试过,是我脑子里写的,所以,可能存在错误:)
我有 3 个超过 10GB 的文件需要拆分成 6 个较小的文件。我通常会使用 R 之类的东西来加载文件并将其分成较小的块,但文件的大小阻止它们被读入 R - 即使有 20GB 的 RAM。
我不知道下一步该怎么做,非常感谢任何提示。
在python中,使用generators/iterators你不应该加载内存中的所有数据。
逐行阅读。
Csv 库为您提供 reader 和编写器 类,可以完成工作。
要拆分文件,您可以这样写:
import csv
# your input file (10GB)
in_csvfile = open('source.csv', "r")
# reader, that would read file for you line-by-line
reader = csv.DictReader(in_csvfile)
# number of current line read
num = 0
# number of output file
output_file_num = 1
# your output file
out_csvfile = open('out_{}.csv'.format(output_file_num), "w")
# writer should be constructed in a read loop,
# because we need csv headers to be already available
# to construct writer object
writer = None
for row in reader:
num += 1
# Here you have your data line in a row variable
# If writer doesn't exists, create one
if writer is None:
writer = csv.DictWriter(
out_csvfile,
fieldnames=row.keys(),
delimiter=",", quotechar='"', escapechar='"',
lineterminator='\n', quoting=csv.QUOTE_NONNUMERIC
)
# Write a row into a writer (out_csvfile, remember?)
writer.writerow(row)
# If we got a 10000 rows read, save current out file
# and create a new one
if num > 10000:
output_file_num += 1
out_csvfile.close()
writer = None
# create new file
out_csvfile = open('out_{}.csv'.format(output_file_num), "w")
# reset counter
num = 0
# Closing the files
in_csvfile.close()
out_csvfile.close()
我没有测试过,是我脑子里写的,所以,可能存在错误:)