Dask 复制 Pandas 值在 Groupby 上计数

Question

我想做的是在 dask 中复制 panda 的值计数 + idxmax 函数，因为我有很多数据。这是一个示例数据框：

partner_num cust_id item_id revw_ratg_num   revw_dt item_qty
0   100 01  5   05/30/2000  0
0   100 03  5   05/30/2000  0
0   100 02  5   05/30/2000  0
1   200 13  4   04/01/2000  0
1   200 14  5   04/01/2000  1
2   200 22  2   04/01/2000  1
3   200 37  3   04/01/2000  1
9   300 92  1   03/24/2000  1
9   300 93  1   03/24/2000  1
9   300 94  1   03/24/2000  0
9   300 99  1   03/24/2000  0
6   300 91  2   03/24/2000  0

>>>df.head()
   partner_num  cust_id  item_id  revw_ratg_num     revw_dt  item_qty
0            0      100        1              5  05/30/2000         0
1            0      100        3              5  05/30/2000         0
2            0      100        2              5  05/30/2000         0
3            1      200       13              4  04/01/2000         0
4            1      200       14              5  04/01/2000         1

在 pandas 中你可以这样做：

df = pd.read_csv("fake_data.txt", sep="\t")
df.groupby(["cust_id"]).item_qty.value_counts()

cust_id  item_qty
100      0           3
200      1           3
         0           1
300      0           3
         1           2

然而，当你在 Dask 中做同样的事情时，它失败了，抛出一个属性错误

df1 = dd.read_csv("fake_data.txt", sep="\t")
df1.groupby(["cust_id"]).item_qty.value_counts()

Traceback (most recent call last):
  File "<pyshell#14>", line 1, in <module>
    df1.groupby(["cust_id"]).item_qty.value_counts()
AttributeError: 'SeriesGroupBy' object has no attribute 'value_counts''

我真正想要做的是能够在 Dask 中获得这两个值，以及它们在多列 groupby 之后的出现次数。任何替代解决方案都是可以接受的，我只想完成工作！

Answer 1

value_counts 在 dask API 中不直接支持数据帧。使用 apply 达到您想要的结果。

请注意，value_counts 作为系列方法受支持。

>>> df1.groupby(['cust_id']).item_qty.apply(lambda x: x.value_counts()).compute()
cust_id   
100      0    3
200      1    3
         0    1
300      0    3
         1    2
Name: item_qty, dtype: int64

Dask 复制 Pandas 值在 Groupby 上计数

Dask replicate Pandas value counts on Groupby

python

bigdata

dataframe

pandas

dask