如何在不同的 pandas 数据帧块中保持一致的索引号

Question

我正在使用 pandas 读取大文件，所以我使用：

for df_small in pd.read_csv("largefile.txt", chunksize=1000,
                         iterator=True, low_memory=False):

并且我需要为每个块添加一列 'seqnum'，它将为所有块建立一致的索引：

for df_small in pd.read_csv("largefile.txt", chunksize=1000,
                             iterator=True, low_memory=False):

    df_small ['seqnum'] == df_small .index.values

因此对于第一个块，df_small ['seqnum'] 将是：

0
1
2
...
999

但是第二个块的 df_small ['seqnum'] 仍然是：

0
1
2
...
999

这不是我想要的，理想的df_small ['seqnum']第二块应该是：

有没有办法做到这一点？

Answer 1

只需创建一个变量来跟踪下一个块的起始索引，如下所示：

seq_num = 0
for df_small in pd.read_csv("largefile.txt", chunksize=1000,
                         iterator=True, low_memory=False):

    df['seqnum'] = df.index + seq_num
    seq_num = df.index[-1] + 1

Answer 2

使用df_small的索引：

for df_small in pd.read_csv("data1.csv", chunksize=3,
                             iterator=True, low_memory=False):
    df_small['seqnum'] = df_small.index.values
    print(df_small)

输出：

  Name  seqnum  # <- 1st iteration
0    A       0
1    B       1
2    C       2

  Name  seqnum  # <- 2nd iteration
3    D       3
4    E       4
5    F       5

  Name  seqnum  # <- 3rd iteration
6    G       6
7    H       7
8    I       8

   Name  seqnum  # <- 4th iteration
9     J       9
10    K      10
11    L      11

如何在不同的 pandas 数据帧块中保持一致的索引号

How to have consistently index number in different pandas dataframe chunks

python

numpy

dataframe

pandas

numpy-ndarray