Pandas 中的哈希 table 映射

Question

我有一个包含数百万行数据的大型数据集。数据列之一是 ID。

我还有另一个（哈希）table，它将索引范围映射到满足特定条件的特定组。

映射索引范围以将它们作为附加列包含在 pandas 中的数据集上的有效方法是什么？

例如，假设数据集如下所示：

In [18]:
print(df_test)

Out [19]:
    ID
0   13
1   14
2   15
3   16
4   17
5   18
6   19
7   20
8   21
9   22
10  23
11  24
12  25
13  26
14  27
15  28
16  29
17  30
18  31
19  32

现在具有索引范围的散列 table 如下所示：

In [20]:
print(df_hash)

Out [21]:
   ID_first
0         0
1         2
2        10

其中索引指定了我需要的组号。

我试过这样做：

for index in range(df_hash.size):
    try:
        df_test.loc[df_hash.ID_first[index]:df_hash.ID_first[index + 1], 'Group'] = index
    except:
        df_test.loc[df_hash.ID_first[index]:, 'Group'] = index

效果很好，除了它在哈希 table 数据帧的长度（数十万行）上循环时确实很慢。它产生以下答案（我想要的）：

In [23]:
print(df_test)

Out [24]:
    ID  Group
0   13    0
1   14    0
2   15    1
3   16    1
4   17    1
5   18    1
6   19    1
7   20    1
8   21    1
9   22    1
10  23    2
11  24    2
12  25    2
13  26    2
14  27    2
15  28    2
16  29    2
17  30    2
18  31    2
19  32    2

有没有办法更有效地做到这一点？

Answer 1

你可以series.isin with series.cumsum

df_test['group'] = df_test['ID'].isin(df_hash['ID_first']).cumsum() #.sub(1)

print(df_test)

    ID  group
0    0      1
1    1      1
2    2      2
3    3      2
4    4      2
5    5      2
6    6      2
7    7      2
8    8      2
9    9      2
10  10      3
11  11      3
12  12      3
13  13      3
14  14      3
15  15      3
16  16      3
17  17      3
18  18      3
19  19      3

Answer 2

您可以 map df_test 的索引使用 ID_first 到 df_hash 的索引，然后 ffill。需要构造一个系列，因为 pd.Index class 没有 ffill 方法。

df_test['group'] = (pd.Series(df_test.index.map(dict(zip(df_hash.ID_first, df_hash.index))), 
                              index=df_test.index)
                      .ffill(downcast='infer'))

#    ID  group
#0   13      0
#1   14      0
#2   15      1
#...
#9   22      1
#10  23      2
#...
#17  30      2
#18  31      2
#19  32      2

Pandas 中的哈希 table 映射

Hash table mapping in Pandas

python

hashtable

pandas