为什么 Pandas 在我的代码中迭代 csv 时跳过第一组块

Question

我有一个非常大的 CSV 文件，我使用 pandas' 块函数通过迭代读取该文件。问题：如果例如chunksize=2，它跳过前两行，我收到的第一个块是第 3-4 行。

基本上，如果我使用 nrows=4 读取 CSV，我会得到前 4 行，而当使用 chunksize=2 分块同一文件时，我会得到第 3 行和第 4 行，然后是第 5 行和第 6 行，...

#1. Read with nrows  
#read first 4 rows in csv files and merge date and time column to be used as index
reader = pd.read_csv('filename.csv', delimiter=',', parse_dates={"Datetime" : [1,2]}, index_col=[0], nrows=4)

print (reader)

01/01/2016 - 09:30 - A - 100
01/01/2016 - 13:30 - A - 110
01/01/2016 - 15:30 - A - 120
02/01/2016 - 10:30 - A - 115

#2. Iterate over csv file with chunks
#iterate over csv file in chunks and merge date and time column to be used as index
reader = pd.read_csv('filename.csv', delimiter=',', parse_dates={"Datetime" : [1,2]}, index_col=[0], chunksize=2)

for chunk in reader:

    #create a dataframe from chunks
    df = reader.get_chunk()
    print (df)

01/01/2016 - 15:30 - A - 120
02/01/2016 - 10:30 - A - 115

将块大小增加到 10 会跳过前 10 行。

有什么办法可以解决这个问题吗？我已经有了一个可行的解决方法，我想了解我在哪里做错了。

欢迎任何意见！

Answer 1

不要打电话给 get_chunk。你已经有了你的块，因为你正在遍历 reader，即 chunk 是你的 DataFrame。在循环中调用 print(chunk)，您应该会看到预期的输出。

正如@MaxU 在评论中指出的那样，如果您想要不同大小的块，则需要使用 get_chunk：reader.get_chunk(500)、reader.get_chunk(100) 等

为什么 Pandas 在我的代码中迭代 csv 时跳过第一组块

Why does Pandas skip first set of chunks when iterating over csv in my code

python

csv

chunks

pandas