如何使用 Python (200 GB+) 从长 csv 文件的中间读取块
How to read chunk from middle of a long csv file using Python (200 GB+)
我有一个很大的 csv 文件,我正在分块读取它。在进程中间内存已满所以我想从它离开的地方重新启动。我知道哪个块,但不知道如何直接转到那个块。
这是我试过的。
# data is the txt file
reader = pd.read_csv(data ,
delimiter = "\t",
chunksize = 1000
)
# Please see the code below. When my last process broke, i was 154 so I think it should
# start from 154000th line. This time I don't
# plan to read whole file at once so I have an
# end point at 160000
first = 154*1000
last = 160*1000
output_path = 'usa_hotspot_data_' + str(first) + '_' + str(last) + '.csv'
print("Output file: ", output_path)
try:
os.remove(output_path)
except OSError:
pass
# Read chunks and save to a new csv
for i,chunk in enumerate(reader):
if (i >= first and i<=last) :
< -- here I do something -- >
# Progress Bar to keep track
if (i% 1000 == 0):
print("#", end ='')
但是,这要花很多时间才能到达我想去的第 i 条线。怎样才能跳过前面的大块直接去呢?
skiprows: Line numbers to skip (0-indexed) or number of lines to skip
(int) at the start of the file.
您可以将此 skiprows 传递给 read_csv
,它的作用类似于偏移量。
我有一个很大的 csv 文件,我正在分块读取它。在进程中间内存已满所以我想从它离开的地方重新启动。我知道哪个块,但不知道如何直接转到那个块。
这是我试过的。
# data is the txt file
reader = pd.read_csv(data ,
delimiter = "\t",
chunksize = 1000
)
# Please see the code below. When my last process broke, i was 154 so I think it should
# start from 154000th line. This time I don't
# plan to read whole file at once so I have an
# end point at 160000
first = 154*1000
last = 160*1000
output_path = 'usa_hotspot_data_' + str(first) + '_' + str(last) + '.csv'
print("Output file: ", output_path)
try:
os.remove(output_path)
except OSError:
pass
# Read chunks and save to a new csv
for i,chunk in enumerate(reader):
if (i >= first and i<=last) :
< -- here I do something -- >
# Progress Bar to keep track
if (i% 1000 == 0):
print("#", end ='')
但是,这要花很多时间才能到达我想去的第 i 条线。怎样才能跳过前面的大块直接去呢?
skiprows: Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.
您可以将此 skiprows 传递给 read_csv
,它的作用类似于偏移量。