Pandas dataframe加入groupby加速
Pandas dataframe join groupby speed up
我正在根据其他列的分组向数据框添加一些列。我做了一些分组、计数,最后将结果连接回原始数据框。
完整数据包括1M行,我首先尝试了20k行的方法,它工作正常。对于客户添加到订单中的每个项目,数据都有一个条目。
这是一个示例数据:
import numpy as np
import pandas as pd
data = np.matrix([[101,201,301],[101,201,302],[101,201,303],[101,202,301],[101,202,302],[101,203,301]])
df = pd.DataFrame(data, columns=['customer_id', 'order_id','item_id'])
df['total_nitems_user_lifetime'] = df.join(df.groupby('customer_id').count()\
['order_id'],on='customer_id',rsuffix="_x")['order_id_x']
df['nitems_in_order'] = df.join(df.groupby('order_id').count()\
['customer_id'],on='order_id',rsuffix="_x")['customer_id_x']
对于上面的示例数据,所需的输出是:
| customer_id | order_id | item_id | total_nitems_user_lifetime | nitems_in_order
| 101 | 201 | 301 | 6 | 3
| 101 | 201 | 302 | 6 | 3
| 101 | 201 | 303 | 6 | 3
| 101 | 202 | 301 | 6 | 2
| 101 | 202 | 302 | 6 | 2
| 101 | 203 | 301 | 6 | 1
即使是 1M 行也能相对快速运行的代码片段是:
df['total_nitems_user_lifetime'] = df.join(df.groupby('customer_id').count()\
['order_id'],on='customer_id',rsuffix="_x")['order_id_x']
但是类似的连接需要相当长的时间~几个小时:
df['nitems_in_order'] = df.join(df.groupby('order_id').count()\
['customer_id'],on='order_id',rsuffix="_x")['customer_id_x']
我希望有一种更聪明的方法来获得相同的聚合值。我理解为什么在第二种情况下需要很长时间,因为组的数量增加了很多。谢谢
好的,我可以看到你想要达到的目标,并且在这个样本量上它的速度快了 2 倍以上,我认为也可能扩展得更好,基本上而不是 joining/merging 你的 groupby 返回的结果到你原来的 df,只需调用 transform
:
In [24]:
%timeit df['total_nitems_user_lifetime'] = df.groupby('customer_id')['order_id'].transform('count')
%timeit df['nitems_in_order'] = df.groupby('order_id')['customer_id'].transform('count')
df
100 loops, best of 3: 2.66 ms per loop
100 loops, best of 3: 2.85 ms per loop
Out[24]:
customer_id order_id item_id total_nitems_user_lifetime nitems_in_order
0 101 201 301 6 3
1 101 201 302 6 3
2 101 201 303 6 3
3 101 202 301 6 2
4 101 202 302 6 2
5 101 203 301 6 1
In [26]:
%timeit df['total_nitems_user_lifetime'] = df.join(df.groupby('customer_id').count()\
['order_id'],on='customer_id',rsuffix="_x")['order_id_x']
%timeit df['nitems_in_order'] = df.join(df.groupby('order_id').count()\
['customer_id'],on='order_id',rsuffix="_x")['customer_id_x']
df
100 loops, best of 3: 6.4 ms per loop
100 loops, best of 3: 6.46 ms per loop
Out[26]:
customer_id order_id item_id total_nitems_user_lifetime nitems_in_order
0 101 201 301 6 3
1 101 201 302 6 3
2 101 201 303 6 3
3 101 202 301 6 2
4 101 202 302 6 2
5 101 203 301 6 1
有趣的是,当我在 600,000 行 df 上尝试此操作时:
In [34]:
%timeit df['total_nitems_user_lifetime'] = df.groupby('customer_id')['order_id'].transform('count')
%timeit df['nitems_in_order'] = df.groupby('order_id')['customer_id'].transform('count')
10 loops, best of 3: 160 ms per loop
1 loops, best of 3: 231 ms per loop
In [36]:
%timeit df['total_nitems_user_lifetime'] = df.join(df.groupby('customer_id').count()\
['order_id'],on='customer_id',rsuffix="_x")['order_id_x']
%timeit df['nitems_in_order'] = df.join(df.groupby('order_id').count()\
['customer_id'],on='order_id',rsuffix="_x")['customer_id_x']
10 loops, best of 3: 208 ms per loop
10 loops, best of 3: 215 ms per loop
我的第一种方法大约快 25%,但实际上比你的方法慢,我认为值得尝试你的真实数据,看看它是否会产生任何速度改进。
如果我们合并创建的列,使其在一行中:
In [40]:
%timeit df['total_nitems_user_lifetime'], df['nitems_in_order'] = df.groupby('customer_id')['order_id'].transform('count'), df.groupby('order_id')['customer_id'].transform('count')
1 loops, best of 3: 425 ms per loop
In [42]:
%timeit df['total_nitems_user_lifetime'], df['nitems_in_order'] = df.join(df.groupby('customer_id').count()\
['order_id'],on='customer_id',rsuffix="_x")['order_id_x'] , df.join(df.groupby('order_id').count()\
['customer_id'],on='order_id',rsuffix="_x")['customer_id_x']
1 loops, best of 3: 447 ms per loop
我们可以看到我的组合代码比你的快一点,所以这样做并没有节省多少,通常你可以应用多个聚合函数,这样你就可以 return 多列,但这里的问题是您正在按不同的列分组,因此我们必须执行 2 个昂贵的 groupby 操作。
原始方法,100 万行:
df['nitems_in_order'] = df.join(df.groupby('order_id').count()\
['customer_id'],on='order_id',rsuffix="_x")['customer_id_x']
time: 0:00:02.422288
@EdChum 的转换建议:
df['nitems_in_order'] = df.groupby('order_id')['customer_id'].transform('count')
time: 0:00:04.713601
使用groupby,然后select一列,然后计数,转换回dataframe,最后join。结果:快得多:
df = df.join(df.groupby(['order_id'])['order_id'].count().to_frame('nitems_in_order'),on='order_id')
time: 0:00:0.406383
谢谢。
我正在根据其他列的分组向数据框添加一些列。我做了一些分组、计数,最后将结果连接回原始数据框。
完整数据包括1M行,我首先尝试了20k行的方法,它工作正常。对于客户添加到订单中的每个项目,数据都有一个条目。
这是一个示例数据:
import numpy as np
import pandas as pd
data = np.matrix([[101,201,301],[101,201,302],[101,201,303],[101,202,301],[101,202,302],[101,203,301]])
df = pd.DataFrame(data, columns=['customer_id', 'order_id','item_id'])
df['total_nitems_user_lifetime'] = df.join(df.groupby('customer_id').count()\
['order_id'],on='customer_id',rsuffix="_x")['order_id_x']
df['nitems_in_order'] = df.join(df.groupby('order_id').count()\
['customer_id'],on='order_id',rsuffix="_x")['customer_id_x']
对于上面的示例数据,所需的输出是:
| customer_id | order_id | item_id | total_nitems_user_lifetime | nitems_in_order
| 101 | 201 | 301 | 6 | 3
| 101 | 201 | 302 | 6 | 3
| 101 | 201 | 303 | 6 | 3
| 101 | 202 | 301 | 6 | 2
| 101 | 202 | 302 | 6 | 2
| 101 | 203 | 301 | 6 | 1
即使是 1M 行也能相对快速运行的代码片段是:
df['total_nitems_user_lifetime'] = df.join(df.groupby('customer_id').count()\
['order_id'],on='customer_id',rsuffix="_x")['order_id_x']
但是类似的连接需要相当长的时间~几个小时:
df['nitems_in_order'] = df.join(df.groupby('order_id').count()\
['customer_id'],on='order_id',rsuffix="_x")['customer_id_x']
我希望有一种更聪明的方法来获得相同的聚合值。我理解为什么在第二种情况下需要很长时间,因为组的数量增加了很多。谢谢
好的,我可以看到你想要达到的目标,并且在这个样本量上它的速度快了 2 倍以上,我认为也可能扩展得更好,基本上而不是 joining/merging 你的 groupby 返回的结果到你原来的 df,只需调用 transform
:
In [24]:
%timeit df['total_nitems_user_lifetime'] = df.groupby('customer_id')['order_id'].transform('count')
%timeit df['nitems_in_order'] = df.groupby('order_id')['customer_id'].transform('count')
df
100 loops, best of 3: 2.66 ms per loop
100 loops, best of 3: 2.85 ms per loop
Out[24]:
customer_id order_id item_id total_nitems_user_lifetime nitems_in_order
0 101 201 301 6 3
1 101 201 302 6 3
2 101 201 303 6 3
3 101 202 301 6 2
4 101 202 302 6 2
5 101 203 301 6 1
In [26]:
%timeit df['total_nitems_user_lifetime'] = df.join(df.groupby('customer_id').count()\
['order_id'],on='customer_id',rsuffix="_x")['order_id_x']
%timeit df['nitems_in_order'] = df.join(df.groupby('order_id').count()\
['customer_id'],on='order_id',rsuffix="_x")['customer_id_x']
df
100 loops, best of 3: 6.4 ms per loop
100 loops, best of 3: 6.46 ms per loop
Out[26]:
customer_id order_id item_id total_nitems_user_lifetime nitems_in_order
0 101 201 301 6 3
1 101 201 302 6 3
2 101 201 303 6 3
3 101 202 301 6 2
4 101 202 302 6 2
5 101 203 301 6 1
有趣的是,当我在 600,000 行 df 上尝试此操作时:
In [34]:
%timeit df['total_nitems_user_lifetime'] = df.groupby('customer_id')['order_id'].transform('count')
%timeit df['nitems_in_order'] = df.groupby('order_id')['customer_id'].transform('count')
10 loops, best of 3: 160 ms per loop
1 loops, best of 3: 231 ms per loop
In [36]:
%timeit df['total_nitems_user_lifetime'] = df.join(df.groupby('customer_id').count()\
['order_id'],on='customer_id',rsuffix="_x")['order_id_x']
%timeit df['nitems_in_order'] = df.join(df.groupby('order_id').count()\
['customer_id'],on='order_id',rsuffix="_x")['customer_id_x']
10 loops, best of 3: 208 ms per loop
10 loops, best of 3: 215 ms per loop
我的第一种方法大约快 25%,但实际上比你的方法慢,我认为值得尝试你的真实数据,看看它是否会产生任何速度改进。
如果我们合并创建的列,使其在一行中:
In [40]:
%timeit df['total_nitems_user_lifetime'], df['nitems_in_order'] = df.groupby('customer_id')['order_id'].transform('count'), df.groupby('order_id')['customer_id'].transform('count')
1 loops, best of 3: 425 ms per loop
In [42]:
%timeit df['total_nitems_user_lifetime'], df['nitems_in_order'] = df.join(df.groupby('customer_id').count()\
['order_id'],on='customer_id',rsuffix="_x")['order_id_x'] , df.join(df.groupby('order_id').count()\
['customer_id'],on='order_id',rsuffix="_x")['customer_id_x']
1 loops, best of 3: 447 ms per loop
我们可以看到我的组合代码比你的快一点,所以这样做并没有节省多少,通常你可以应用多个聚合函数,这样你就可以 return 多列,但这里的问题是您正在按不同的列分组,因此我们必须执行 2 个昂贵的 groupby 操作。
原始方法,100 万行:
df['nitems_in_order'] = df.join(df.groupby('order_id').count()\
['customer_id'],on='order_id',rsuffix="_x")['customer_id_x']
time: 0:00:02.422288
@EdChum 的转换建议:
df['nitems_in_order'] = df.groupby('order_id')['customer_id'].transform('count')
time: 0:00:04.713601
使用groupby,然后select一列,然后计数,转换回dataframe,最后join。结果:快得多:
df = df.join(df.groupby(['order_id'])['order_id'].count().to_frame('nitems_in_order'),on='order_id')
time: 0:00:0.406383
谢谢。