如何采样一个非常大的 CSV 文件(6GB)
how to sample a very big CSV file(6GB)
有一个很大的CSV文件(第一行是header),现在我想对它进行100个采样(例如line_num%100
),如何使用主内存高效地做到这一点约束?
将文件分成 100 个小文件 one.Or 每 1/100 行作为子文件 1,每 2/100 行作为子文件 2,...,每 100/100 行作为文件 100。
获取100个大小约600M的文件
未获得 100 行或 1/100 尺寸的样本。
我试过这样执行:
fi = [open('split_data//%d.csv'%i,'w') for i in range(100)]
i = 0
with open('data//train.csv') as fin:
first = fin.readline()
for line in fin:
fi[i%100].write(line)
i = i + 1
for i in range(100):
fi[i].close()
但是文件太大运行内存有限,如何处理?
我想用一轮就搞定~
(我的代码可以运行,但是太耗时了,我误以为它崩溃了,抱歉~~)
如注释中所述将文件拆分为 100 个部分(我想以模数方式将文件拆分为 100 个部分,即范围 (200)-->| [0,100];[ 1,101]; [2,102] 和 是的,将一个大文件分成数百个小文件)
import csv
files = [open('part_{}'.format(n), 'wb') for n in xrange(100)]
csvouts = [csv.writer(f) for f in files]
with open('yourcsv') as fin:
csvin = csv.reader(fin)
next(csvin, None) # Skip header
for rowno, row in enumerate(csvin):
csvouts[rowno % 100].writerow(row)
for f in files:
f.close()
您可以 islice
使用一个步骤而不是模数行号来覆盖文件,例如:
import csv
from itertools import islice
with open('yourcsv') as fin:
csvin = csv.reader(fin)
# Skip header, and then return every 100th until file ends
for line in islice(csvin, 1, None, 100):
# do something with line
示例:
r = xrange(1000)
res = list(islice(r, 1, None, 100))
# [1, 101, 201, 301, 401, 501, 601, 701, 801, 901]
根据@Jon Clements 的回答,我也会对这个变体进行基准测试:
import csv
from itertools import islice
with open('in.csv') as fin:
first = fin.readline() # discard the header
csvin = csv.reader( islice(fin, None, None, 100) ) # this line is the only difference
for row in csvin:
print row # do something with row
如果您只需要 100 个样本,您可以使用这个想法,它只在文件内等间隔的位置进行 100 次读取。这应该适用于行长基本一致的 CSV 文件。
def sample100(path):
with open(path) as fin:
end = os.fstat(fin.fileno()).st_size
fin.readline() # skip the first line
start = fin.tell()
step = (end - start) / 100
offset = start
while offset < end:
fin.seek(offset)
fin.readline() # this might not be a complete line
if fin.tell() < end:
yield fin.readline() # this is a complete non-empty line
else:
break # not really necessary...
offset = offset + step
for row in csv.reader( sample100('in.csv') ):
# do something with row
我认为您可以只打开同一个文件 10 次,然后独立地操作(读取)每个文件,有效地将其拆分为 sub-file 而无需实际操作。
不幸的是,这需要提前知道文件中有多少行,并且需要读取整个文件一次以计算它们。另一方面,这应该相对较快,因为没有其他处理发生。
为了说明和测试这种方法,我创建了一个更简单的 — 每行只有一个项目 — 并且更小的 csv 测试文件,看起来像这样(第一行是 header 行,未计算在内):
line_no
1
2
3
4
5
...
9995
9996
9997
9998
9999
10000
这是代码和示例输出:
from collections import deque
import csv
# count number of rows in csv file
# (this requires reading the whole file)
file_name = 'mycsvfile.csv'
with open(file_name, 'rb') as csv_file:
for num_rows, _ in enumerate(csv.reader(csv_file)): pass
rows_per_section = num_rows // 10
print 'number of rows: {:,d}'.format(num_rows)
print 'rows per section: {:,d}'.format(rows_per_section)
csv_files = [open(file_name, 'rb') for _ in xrange(10)]
csv_readers = [csv.reader(f) for f in csv_files]
map(next, csv_readers) # skip header
# position each file handle at its starting position in file
for i in xrange(10):
for j in xrange(i * rows_per_section):
try:
next(csv_readers[i])
except StopIteration:
pass
# read rows from each of the sections
for i in xrange(rows_per_section):
# elements are one row from each section
rows = [next(r) for r in csv_readers]
print rows # show what was read
# clean up
for i in xrange(10):
csv_files[i].close()
输出:
number of rows: 10,000
rows per section: 1,000
[['1'], ['1001'], ['2001'], ['3001'], ['4001'], ['5001'], ['6001'], ['7001'], ['8001'], ['9001']]
[['2'], ['1002'], ['2002'], ['3002'], ['4002'], ['5002'], ['6002'], ['7002'], ['8002'], ['9002']]
...
[['998'], ['1998'], ['2998'], ['3998'], ['4998'], ['5998'], ['6998'], ['7998'], ['8998'], ['9998']]
[['999'], ['1999'], ['2999'], ['3999'], ['4999'], ['5999'], ['6999'], ['7999'], ['8999'], ['9999']]
[['1000'], ['2000'], ['3000'], ['4000'], ['5000'], ['6000'], ['7000'], ['8000'], ['9000'], ['10000']]
有一个很大的CSV文件(第一行是header),现在我想对它进行100个采样(例如line_num%100
),如何使用主内存高效地做到这一点约束?
将文件分成 100 个小文件 one.Or 每 1/100 行作为子文件 1,每 2/100 行作为子文件 2,...,每 100/100 行作为文件 100。 获取100个大小约600M的文件
未获得 100 行或 1/100 尺寸的样本。
我试过这样执行:
fi = [open('split_data//%d.csv'%i,'w') for i in range(100)]
i = 0
with open('data//train.csv') as fin:
first = fin.readline()
for line in fin:
fi[i%100].write(line)
i = i + 1
for i in range(100):
fi[i].close()
但是文件太大运行内存有限,如何处理? 我想用一轮就搞定~
(我的代码可以运行,但是太耗时了,我误以为它崩溃了,抱歉~~)
如注释中所述将文件拆分为 100 个部分(我想以模数方式将文件拆分为 100 个部分,即范围 (200)-->| [0,100];[ 1,101]; [2,102] 和 是的,将一个大文件分成数百个小文件)
import csv
files = [open('part_{}'.format(n), 'wb') for n in xrange(100)]
csvouts = [csv.writer(f) for f in files]
with open('yourcsv') as fin:
csvin = csv.reader(fin)
next(csvin, None) # Skip header
for rowno, row in enumerate(csvin):
csvouts[rowno % 100].writerow(row)
for f in files:
f.close()
您可以 islice
使用一个步骤而不是模数行号来覆盖文件,例如:
import csv
from itertools import islice
with open('yourcsv') as fin:
csvin = csv.reader(fin)
# Skip header, and then return every 100th until file ends
for line in islice(csvin, 1, None, 100):
# do something with line
示例:
r = xrange(1000)
res = list(islice(r, 1, None, 100))
# [1, 101, 201, 301, 401, 501, 601, 701, 801, 901]
根据@Jon Clements 的回答,我也会对这个变体进行基准测试:
import csv
from itertools import islice
with open('in.csv') as fin:
first = fin.readline() # discard the header
csvin = csv.reader( islice(fin, None, None, 100) ) # this line is the only difference
for row in csvin:
print row # do something with row
如果您只需要 100 个样本,您可以使用这个想法,它只在文件内等间隔的位置进行 100 次读取。这应该适用于行长基本一致的 CSV 文件。
def sample100(path):
with open(path) as fin:
end = os.fstat(fin.fileno()).st_size
fin.readline() # skip the first line
start = fin.tell()
step = (end - start) / 100
offset = start
while offset < end:
fin.seek(offset)
fin.readline() # this might not be a complete line
if fin.tell() < end:
yield fin.readline() # this is a complete non-empty line
else:
break # not really necessary...
offset = offset + step
for row in csv.reader( sample100('in.csv') ):
# do something with row
我认为您可以只打开同一个文件 10 次,然后独立地操作(读取)每个文件,有效地将其拆分为 sub-file 而无需实际操作。
不幸的是,这需要提前知道文件中有多少行,并且需要读取整个文件一次以计算它们。另一方面,这应该相对较快,因为没有其他处理发生。
为了说明和测试这种方法,我创建了一个更简单的 — 每行只有一个项目 — 并且更小的 csv 测试文件,看起来像这样(第一行是 header 行,未计算在内):
line_no
1
2
3
4
5
...
9995
9996
9997
9998
9999
10000
这是代码和示例输出:
from collections import deque
import csv
# count number of rows in csv file
# (this requires reading the whole file)
file_name = 'mycsvfile.csv'
with open(file_name, 'rb') as csv_file:
for num_rows, _ in enumerate(csv.reader(csv_file)): pass
rows_per_section = num_rows // 10
print 'number of rows: {:,d}'.format(num_rows)
print 'rows per section: {:,d}'.format(rows_per_section)
csv_files = [open(file_name, 'rb') for _ in xrange(10)]
csv_readers = [csv.reader(f) for f in csv_files]
map(next, csv_readers) # skip header
# position each file handle at its starting position in file
for i in xrange(10):
for j in xrange(i * rows_per_section):
try:
next(csv_readers[i])
except StopIteration:
pass
# read rows from each of the sections
for i in xrange(rows_per_section):
# elements are one row from each section
rows = [next(r) for r in csv_readers]
print rows # show what was read
# clean up
for i in xrange(10):
csv_files[i].close()
输出:
number of rows: 10,000
rows per section: 1,000
[['1'], ['1001'], ['2001'], ['3001'], ['4001'], ['5001'], ['6001'], ['7001'], ['8001'], ['9001']]
[['2'], ['1002'], ['2002'], ['3002'], ['4002'], ['5002'], ['6002'], ['7002'], ['8002'], ['9002']]
...
[['998'], ['1998'], ['2998'], ['3998'], ['4998'], ['5998'], ['6998'], ['7998'], ['8998'], ['9998']]
[['999'], ['1999'], ['2999'], ['3999'], ['4999'], ['5999'], ['6999'], ['7999'], ['8999'], ['9999']]
[['1000'], ['2000'], ['3000'], ['4000'], ['5000'], ['6000'], ['7000'], ['8000'], ['9000'], ['10000']]