如何在不同的 pandas 数据帧块中保持一致的索引号

How to have consistently index number in different pandas dataframe chunks

我正在使用 pandas 读取大文件,所以我使用:

for df_small in pd.read_csv("largefile.txt", chunksize=1000,
                         iterator=True, low_memory=False):

并且我需要为每个块添加一列 'seqnum',它将为所有块建立一致的索引:

for df_small in pd.read_csv("largefile.txt", chunksize=1000,
                             iterator=True, low_memory=False):

    df_small ['seqnum'] == df_small .index.values

因此对于第一个块,df_small ['seqnum'] 将是:

0
1
2
...
999

但是第二个块的 df_small ['seqnum'] 仍然是:

0
1
2
...
999

这不是我想要的,理想的df_small ['seqnum']第二块应该是:

1000
1001
1002
...
1999

有没有办法做到这一点?

只需创建一个变量来跟踪下一个块的起始索引,如下所示:

seq_num = 0
for df_small in pd.read_csv("largefile.txt", chunksize=1000,
                         iterator=True, low_memory=False):

    df['seqnum'] = df.index + seq_num
    seq_num = df.index[-1] + 1

使用df_small的索引:

for df_small in pd.read_csv("data1.csv", chunksize=3,
                             iterator=True, low_memory=False):
    df_small['seqnum'] = df_small.index.values
    print(df_small)

输出:

  Name  seqnum  # <- 1st iteration
0    A       0
1    B       1
2    C       2

  Name  seqnum  # <- 2nd iteration
3    D       3
4    E       4
5    F       5

  Name  seqnum  # <- 3rd iteration
6    G       6
7    H       7
8    I       8

   Name  seqnum  # <- 4th iteration
9     J       9
10    K      10
11    L      11