Pandas 中一组中首次出现的连续 True 值的累积和

Question

我有一个 Pandas 数据框列 A、B、C 和 D。我希望所需的列如下：

按 ['A'、'B'、'C'] 分组，我希望所需列显示 FIRST CONSECUTIVE[=29] 的累计总和=] True D 列中的值。

A B C D Desired Column

100 AAA 001 False 0

100 AAA 001 False 0

200 BBB 055 True 1

200 BBB 055 True 2

200 BBB 055 True 3

200 BBB 055 False 3

200 BBB 055 True 3

300 CCC 099 False 0

300 CCC 099 True 0

False 值停止组中的累积和，并且不考虑 False 之后的任何 True 值。

我想用这个table来计算一个总和：

A B C Max(Desired Column)

100 AAA 001 0

200 BBB 055 3

300 CCC 099 0

感谢您的帮助！

Answer 1

可以用cummin将False之后的所有值标记为False，然后计算cumsum:

df['Desired Column'] = df.groupby(['A', 'B', 'C']).D.transform(lambda x: x.cummin().cumsum())

df
     A    B   C      D  Desired Column
0  100  AAA   1  False               0
1  100  AAA   1  False               0
2  200  BBB  55   True               1
3  200  BBB  55   True               2
4  200  BBB  55   True               3
5  200  BBB  55  False               3
6  200  BBB  55   True               3
7  300  CCC  99  False               0
8  300  CCC  99   True               0

如果你只需要聚合列，那么你可以用argmin找到第一个False的索引：

df.groupby(['A', 'B', 'C'], as_index=False).D.agg(
  lambda x: len(x) if x.all() else x.argmin()
)

     A    B   C  D
0  100  AAA   1  0
1  200  BBB  55  3
2  300  CCC  99  0

Answer 2

我在您的示例数据中添加了一个组，以包括该组以一个不连续的 True 开头的情况，然后是 False。

df.expanding.min() does the same as cummin，而min_periods控制多少行后开始累加。 bfill 在每组的第一行相应地填充 nan 个值。

df['actual'] = (df.groupby(['A','B','C']).D
                  .apply(lambda x: x.expanding(min_periods=2)
                                    .min()
                                    .bfill()
                                    .cumsum())
                  .astype('int'))

assert df.actual.equals(df.Desired), 'different results, try again'
df

输出

      A    B    C      D  Desired  actual
0   100  AAA    1  False        0       0
1   100  AAA    1  False        0       0
2   200  BBB   55   True        1       1
3   200  BBB   55   True        2       2
4   200  BBB   55   True        3       3
5   200  BBB   55  False        3       3
6   200  BBB   55   True        3       3
7   300  CCC   99  False        0       0
8   300  CCC   99   True        0       0
9   400  DDD  199   True        0       0
10  400  DDD  199  False        0       0

正在准备示例数据框

import pandas as pd
import io

t = '''
A,B,C,D,Desired
100,AAA,1,False,0
100,AAA,1,False,0
200,BBB,55,True,1
200,BBB,55,True,2
200,BBB,55,True,3
200,BBB,55,False,3
200,BBB,55,True,3
300,CCC,99,False,0
300,CCC,99,True,0
400,DDD,199,True,0
400,DDD,199,False,0
'''

df = pd.read_csv(io.StringIO(t))
df

输出

      A    B    C      D  Desired
0   100  AAA    1  False        0
1   100  AAA    1  False        0
2   200  BBB   55   True        1
3   200  BBB   55   True        2
4   200  BBB   55   True        3
5   200  BBB   55  False        3
6   200  BBB   55   True        3
7   300  CCC   99  False        0
8   300  CCC   99   True        0
9   400  DDD  199   True        0
10  400  DDD  199  False        0

获取每个组的最大行数

df.groupby(['A','B','C']).actual.max().reset_index()

输出

     A    B    C  actual
0  100  AAA    1       0
1  200  BBB   55       3
2  300  CCC   99       0
3  400  DDD  199       0

Pandas 中一组中首次出现的连续 True 值的累积和

Cumulative sum of first occurence of consecutive True values in a group in Pandas

python

dataframe

cumulative-sum

pandas

pandas-groupby

A	B	C	D	Desired Column
100	AAA	001	False	0
100	AAA	001	False	0
200	BBB	055	True	1
200	BBB	055	True	2
200	BBB	055	True	3
200	BBB	055	False	3
200	BBB	055	True	3
300	CCC	099	False	0
300	CCC	099	True	0