pandas 在列上使用 cumsum 并创建一个新的布尔列，将边缘情况标记为 True

Question

我有以下 df,

year_month    pct
201903        50
201903        40
201903         5
201903         5
201904        90
201904         5
201904         5

我想创建一个名为 non-tail 的布尔列，它满足以下条件，

df.sort_values(['pct'], ascending=False).groupby('year_month')['pct'].apply(lambda x: x.cumsum().le(80))

在 non-tail 中，pct 中将添加的任何下一个使 cumsum 立即大于 80 的值也将标记为 True，因此结果看起来喜欢

 year_month    pct    non-tail
 201903        50     True
 201903        40     True
 201903         5     False
 201903         5     False
 201904        90     True
 201904         5     False
 201904         5     False

Answer 1

IIUC，你需要移动 cumsum:

df = df.sort_values(['year_month','pct'], ascending=[True,False])
(df.groupby('year_month')['pct']
   .apply(lambda x: x.cumsum().le(80)
                     .shift(fill_value=True)
         )
)

给你：

0     True
1     True
2    False
3    False
4     True
5    False
6    False
Name: pct, dtype: bool

Answer 2

我会做什么

df.pct.iloc[::-1].groupby(df['year_month']).cumsum()>20
Out[306]: 
6    False
5    False
4     True
3    False
2    False
1     True
0     True
Name: pct, dtype: bool

pandas 在列上使用 cumsum 并创建一个新的布尔列，将边缘情况标记为 True

pandas use cumsum on a column and create a new boolean column that mark edge case as True

dataframe

python-3.x

pandas

cumsum

pandas-groupby