如何在不同的 pandas 数据帧块中保持一致的索引号
How to have consistently index number in different pandas dataframe chunks
我正在使用 pandas 读取大文件,所以我使用:
for df_small in pd.read_csv("largefile.txt", chunksize=1000,
iterator=True, low_memory=False):
并且我需要为每个块添加一列 'seqnum',它将为所有块建立一致的索引:
for df_small in pd.read_csv("largefile.txt", chunksize=1000,
iterator=True, low_memory=False):
df_small ['seqnum'] == df_small .index.values
因此对于第一个块,df_small ['seqnum'] 将是:
0
1
2
...
999
但是第二个块的 df_small ['seqnum'] 仍然是:
0
1
2
...
999
这不是我想要的,理想的df_small ['seqnum']第二块应该是:
1000
1001
1002
...
1999
有没有办法做到这一点?
只需创建一个变量来跟踪下一个块的起始索引,如下所示:
seq_num = 0
for df_small in pd.read_csv("largefile.txt", chunksize=1000,
iterator=True, low_memory=False):
df['seqnum'] = df.index + seq_num
seq_num = df.index[-1] + 1
使用df_small
的索引:
for df_small in pd.read_csv("data1.csv", chunksize=3,
iterator=True, low_memory=False):
df_small['seqnum'] = df_small.index.values
print(df_small)
输出:
Name seqnum # <- 1st iteration
0 A 0
1 B 1
2 C 2
Name seqnum # <- 2nd iteration
3 D 3
4 E 4
5 F 5
Name seqnum # <- 3rd iteration
6 G 6
7 H 7
8 I 8
Name seqnum # <- 4th iteration
9 J 9
10 K 10
11 L 11
我正在使用 pandas 读取大文件,所以我使用:
for df_small in pd.read_csv("largefile.txt", chunksize=1000,
iterator=True, low_memory=False):
并且我需要为每个块添加一列 'seqnum',它将为所有块建立一致的索引:
for df_small in pd.read_csv("largefile.txt", chunksize=1000,
iterator=True, low_memory=False):
df_small ['seqnum'] == df_small .index.values
因此对于第一个块,df_small ['seqnum'] 将是:
0
1
2
...
999
但是第二个块的 df_small ['seqnum'] 仍然是:
0
1
2
...
999
这不是我想要的,理想的df_small ['seqnum']第二块应该是:
1000
1001
1002
...
1999
有没有办法做到这一点?
只需创建一个变量来跟踪下一个块的起始索引,如下所示:
seq_num = 0
for df_small in pd.read_csv("largefile.txt", chunksize=1000,
iterator=True, low_memory=False):
df['seqnum'] = df.index + seq_num
seq_num = df.index[-1] + 1
使用df_small
的索引:
for df_small in pd.read_csv("data1.csv", chunksize=3,
iterator=True, low_memory=False):
df_small['seqnum'] = df_small.index.values
print(df_small)
输出:
Name seqnum # <- 1st iteration
0 A 0
1 B 1
2 C 2
Name seqnum # <- 2nd iteration
3 D 3
4 E 4
5 F 5
Name seqnum # <- 3rd iteration
6 G 6
7 H 7
8 I 8
Name seqnum # <- 4th iteration
9 J 9
10 K 10
11 L 11