如何使用 Python (200 GB+) 从长 csv 文件的中间读取块

Question

我有一个很大的 csv 文件，我正在分块读取它。在进程中间内存已满所以我想从它离开的地方重新启动。我知道哪个块，但不知道如何直接转到那个块。

这是我试过的。

# data is the txt file
reader = pd.read_csv(data , 
                     delimiter = "\t",
                     chunksize = 1000
                    )


# Please see the code below. When my last process broke, i was 154 so I think it should 
# start from 154000th line. This time I don't 
# plan to read whole file at once so I have an 
# end point at 160000

first = 154*1000
last = 160*1000

output_path = 'usa_hotspot_data_' + str(first) + '_' + str(last) + '.csv'
print("Output file: ", output_path)

try:
    os.remove(output_path)
except OSError:
    pass

# Read chunks and save to a new csv
for i,chunk in enumerate(reader):
    if (i >= first and i<=last) :
          < -- here I do something  -- > 
        # Progress Bar to keep track 
        if (i% 1000 == 0):
            print("#", end ='')

但是，这要花很多时间才能到达我想去的第 i 条线。怎样才能跳过前面的大块直接去呢？

Answer 1

pandas.read_csv

skiprows: Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.

您可以将此 skiprows 传递给 read_csv，它的作用类似于偏移量。

如何使用 Python (200 GB+) 从长 csv 文件的中间读取块

How to read chunk from middle of a long csv file using Python (200 GB+)

python

csv

chunks

large-data

pandas