Pandas dataframe加入groupby加速

Pandas dataframe join groupby speed up

我正在根据其他列的分组向数据框添加一些列。我做了一些分组、计数,最后将结果连接回原始数据框。

完整数据包括1M行,我首先尝试了20k行的方法,它工作正常。对于客户添加到订单中的每个项目,数据都有一个条目。

这是一个示例数据:

import numpy as np
import pandas as pd
data = np.matrix([[101,201,301],[101,201,302],[101,201,303],[101,202,301],[101,202,302],[101,203,301]])
df = pd.DataFrame(data, columns=['customer_id', 'order_id','item_id'])
df['total_nitems_user_lifetime'] = df.join(df.groupby('customer_id').count()\
      ['order_id'],on='customer_id',rsuffix="_x")['order_id_x']
df['nitems_in_order'] = df.join(df.groupby('order_id').count()\
   ['customer_id'],on='order_id',rsuffix="_x")['customer_id_x']

对于上面的示例数据,所需的输出是:

| customer_id   | order_id | item_id     | total_nitems_user_lifetime | nitems_in_order
|   101 | 201      |   301   |      6             |    3           
|   101 | 201      |   302   |      6             |    3           
|   101 | 201      |   303   |      6             |    3           
|   101 | 202      |   301   |      6             |    2           
|   101 | 202      |   302   |      6             |    2           
|   101 | 203      |   301   |      6             |    1           

即使是 1M 行也能相对快速运行的代码片段是:

df['total_nitems_user_lifetime'] = df.join(df.groupby('customer_id').count()\
          ['order_id'],on='customer_id',rsuffix="_x")['order_id_x']

但是类似的连接需要相当长的时间~几个小时:

df['nitems_in_order'] = df.join(df.groupby('order_id').count()\
       ['customer_id'],on='order_id',rsuffix="_x")['customer_id_x']

我希望有一种更聪明的方法来获得相同的聚合值。我理解为什么在第二种情况下需要很长时间,因为组的数量增加了很多。谢谢

好的,我可以看到你想要达到的目标,并且在这个样本量上它的速度快了 2 倍以上,我认为也可能扩展得更好,基本上而不是 joining/merging 你的 groupby 返回的结果到你原来的 df,只需调用 transform:

In [24]:

%timeit df['total_nitems_user_lifetime'] = df.groupby('customer_id')['order_id'].transform('count')
%timeit df['nitems_in_order'] = df.groupby('order_id')['customer_id'].transform('count')
df
100 loops, best of 3: 2.66 ms per loop
100 loops, best of 3: 2.85 ms per loop
Out[24]:
   customer_id  order_id  item_id  total_nitems_user_lifetime  nitems_in_order
0          101       201      301                           6                3
1          101       201      302                           6                3
2          101       201      303                           6                3
3          101       202      301                           6                2
4          101       202      302                           6                2
5          101       203      301                           6                1
In [26]:


%timeit df['total_nitems_user_lifetime'] = df.join(df.groupby('customer_id').count()\
      ['order_id'],on='customer_id',rsuffix="_x")['order_id_x']
%timeit df['nitems_in_order'] = df.join(df.groupby('order_id').count()\
   ['customer_id'],on='order_id',rsuffix="_x")['customer_id_x']
df
100 loops, best of 3: 6.4 ms per loop
100 loops, best of 3: 6.46 ms per loop
Out[26]:
   customer_id  order_id  item_id  total_nitems_user_lifetime  nitems_in_order
0          101       201      301                           6                3
1          101       201      302                           6                3
2          101       201      303                           6                3
3          101       202      301                           6                2
4          101       202      302                           6                2
5          101       203      301                           6                1

有趣的是,当我在 600,000 行 df 上尝试此操作时:

In [34]:

%timeit df['total_nitems_user_lifetime'] = df.groupby('customer_id')['order_id'].transform('count')
%timeit df['nitems_in_order'] = df.groupby('order_id')['customer_id'].transform('count')
10 loops, best of 3: 160 ms per loop
1 loops, best of 3: 231 ms per loop
In [36]:

%timeit df['total_nitems_user_lifetime'] = df.join(df.groupby('customer_id').count()\
      ['order_id'],on='customer_id',rsuffix="_x")['order_id_x']
%timeit df['nitems_in_order'] = df.join(df.groupby('order_id').count()\
   ['customer_id'],on='order_id',rsuffix="_x")['customer_id_x']
10 loops, best of 3: 208 ms per loop
10 loops, best of 3: 215 ms per loop

我的第一种方法大约快 25%,但实际上比你的方法慢,我认为值得尝试你的真实数据,看看它是否会产生任何速度改进。

如果我们合并创建的列,使其在一行中:

In [40]:

%timeit df['total_nitems_user_lifetime'], df['nitems_in_order'] = df.groupby('customer_id')['order_id'].transform('count'),  df.groupby('order_id')['customer_id'].transform('count')
1 loops, best of 3: 425 ms per loop
In [42]:

%timeit df['total_nitems_user_lifetime'], df['nitems_in_order'] = df.join(df.groupby('customer_id').count()\
      ['order_id'],on='customer_id',rsuffix="_x")['order_id_x'] , df.join(df.groupby('order_id').count()\
   ['customer_id'],on='order_id',rsuffix="_x")['customer_id_x']
1 loops, best of 3: 447 ms per loop

我们可以看到我的组合代码比你的快一点,所以这样做并没有节省多少,通常你可以应用多个聚合函数,这样你就可以 return 多列,但这里的问题是您正在按不同的列分组,因此我们必须执行 2 个昂贵的 groupby 操作。

原始方法,100 万行:

df['nitems_in_order'] = df.join(df.groupby('order_id').count()\
                       ['customer_id'],on='order_id',rsuffix="_x")['customer_id_x']
time:  0:00:02.422288

@EdChum 的转换建议:

df['nitems_in_order'] = df.groupby('order_id')['customer_id'].transform('count')
time: 0:00:04.713601

使用groupby,然后select一列,然后计数,转换回dataframe,最后join。结果:快得多:

df = df.join(df.groupby(['order_id'])['order_id'].count().to_frame('nitems_in_order'),on='order_id')
time: 0:00:0.406383

谢谢。