有没有办法在 dask 中获得每组最大的项目?
Is there a way to get the nlargest items per group in dask?
我有以下数据集:
location category percent
A 5 100.0
B 3 100.0
C 2 50.0
4 13.0
D 2 75.0
3 59.0
4 13.0
5 4.0
我正在尝试获取数据框中按位置分组的最大类别项目。即如果我想要每个组的前 2 个最大百分比,输出应该是:
location category percent
A 5 100.0
B 3 100.0
C 2 50.0
4 13.0
D 2 75.0
3 59.0
看起来在 pandas 中使用 pandas.core.groupby.SeriesGroupBy.nlargest
相对简单,但 dask 没有 nlargest
groupby 函数。一直在尝试 apply
,但似乎无法正常工作。
df.groupby(['location'].apply(lambda x: x['percent'].nlargest(2)).compute()
但我只是得到错误 ValueError: Wrong number of items passed 0, placement implies 8
apply 应该可以,但是你的语法有点不对:
In [11]: df
Out[11]:
Dask DataFrame Structure:
Unnamed: 0 location category percent
npartitions=1
int64 object int64 float64
... ... ... ...
Dask Name: from-delayed, 3 tasks
In [12]: df.groupby("location")["percent"].apply(lambda x: x.nlargest(2), meta=('x', 'f8')).compute()
Out[12]:
location
A 0 100.0
B 1 100.0
C 2 50.0
3 13.0
D 4 75.0
5 59.0
Name: x, dtype: float64
在 pandas 中,您可以使用 .nlargest
和 .rank
作为 groupby 方法,这样您就可以在没有应用的情况下执行此操作:
In [21]: df1
Out[21]:
location category percent
0 A 5 100.0
1 B 3 100.0
2 C 2 50.0
3 C 4 13.0
4 D 2 75.0
5 D 3 59.0
6 D 4 13.0
7 D 5 4.0
In [22]: df1.groupby("location")["percent"].nlargest(2)
Out[22]:
location
A 0 100.0
B 1 100.0
C 2 50.0
3 13.0
D 4 75.0
5 59.0
Name: percent, dtype: float64
Dask.dataframe covers a small but well-used portion of the pandas API.
This limitation is for two reasons:
- The pandas API is huge
- Some operations are genuinely hard to do in parallel (for example sort).
我有以下数据集:
location category percent
A 5 100.0
B 3 100.0
C 2 50.0
4 13.0
D 2 75.0
3 59.0
4 13.0
5 4.0
我正在尝试获取数据框中按位置分组的最大类别项目。即如果我想要每个组的前 2 个最大百分比,输出应该是:
location category percent
A 5 100.0
B 3 100.0
C 2 50.0
4 13.0
D 2 75.0
3 59.0
看起来在 pandas 中使用 pandas.core.groupby.SeriesGroupBy.nlargest
相对简单,但 dask 没有 nlargest
groupby 函数。一直在尝试 apply
,但似乎无法正常工作。
df.groupby(['location'].apply(lambda x: x['percent'].nlargest(2)).compute()
但我只是得到错误 ValueError: Wrong number of items passed 0, placement implies 8
apply 应该可以,但是你的语法有点不对:
In [11]: df
Out[11]:
Dask DataFrame Structure:
Unnamed: 0 location category percent
npartitions=1
int64 object int64 float64
... ... ... ...
Dask Name: from-delayed, 3 tasks
In [12]: df.groupby("location")["percent"].apply(lambda x: x.nlargest(2), meta=('x', 'f8')).compute()
Out[12]:
location
A 0 100.0
B 1 100.0
C 2 50.0
3 13.0
D 4 75.0
5 59.0
Name: x, dtype: float64
在 pandas 中,您可以使用 .nlargest
和 .rank
作为 groupby 方法,这样您就可以在没有应用的情况下执行此操作:
In [21]: df1
Out[21]:
location category percent
0 A 5 100.0
1 B 3 100.0
2 C 2 50.0
3 C 4 13.0
4 D 2 75.0
5 D 3 59.0
6 D 4 13.0
7 D 5 4.0
In [22]: df1.groupby("location")["percent"].nlargest(2)
Out[22]:
location
A 0 100.0
B 1 100.0
C 2 50.0
3 13.0
D 4 75.0
5 59.0
Name: percent, dtype: float64
Dask.dataframe covers a small but well-used portion of the pandas API.
This limitation is for two reasons:
- The pandas API is huge
- Some operations are genuinely hard to do in parallel (for example sort).