基于以 False 作为 Pandas 中的最新值的布尔列扩展均值

Question

如果我有以下数据框：

b = {'user': [1, 1, 1, 1, 2, 2, 2],
 'value': [10, 20, 30, 40, 1, 2, 3],
 'loan': [True, True, True, False, True, False, True]}
temp_df: pd.DataFrame = pd.DataFrame(b)
temp_df['date'] = np.array([23, 24, 25, 26, 27, 28, 29])

   user  value   loan  date
0     1     10   True    23
1     1     20   True    24
2     1     30   True    25
3     1     40  False    26
4     2      1   True    27
5     2      2  False    28
6     2      3   True    29

我想在一个新列中计算每个用户的“滚动”平均值，仅在 loan == True 时才考虑值，它应该是到当前行的平均值，而不是包括当前行。因此，所需的输出应该是这样的：

   user  value   loan  date  cummean_value
0     1     10   True    23        0
1     1     20   True    24        10
2     1     30   True    25        15
3     1     40  False    26        20
4     2      1   True    27        0
5     2      2  False    28        1
6     2      3   True    29        1

当 loan == False 我希望该值是迄今为止计算的最后一个最近平均值（对于 loan 的 True 值）。每个用户的第一个值基本上是 NaN，应该用 0 替换（因为它在所需的输出中）。

Answer 1

让我们试试 groupby + cumsum

temp_df['new'] = temp_df['value'].where(temp_df['loan']).groupby(temp_df['user'])\
      .apply(lambda x : (x.shift().cumsum()/x.shift().notna().cumsum()).ffill().fillna(0))
Out[54]: 
0     0.0
1    10.0
2    15.0
3    20.0
4     0.0
5     1.0
6     1.0
Name: value, dtype: float64

Answer 2

尝试：

# supplementary columns:
temp_df['value2'] = np.where(temp_df['loan'], temp_df['value'], 0)
temp_df['x'] = np.where(temp_df['loan'], 1, 0)

# the whole calculation assuming cummean until given row
temp_df['cummean_value'] = temp_df.groupby('user')['value2'].cumsum() \
    .div(temp_df.groupby('user')['x'].cumsum())

# assuming - until previous row (shift backward, keeping grouping
temp_df['cummean_value'] = temp_df.groupby('user')['cummean_value'].shift().fillna(0)

# clean-up
temp_df.drop(['x', 'value2'], axis=1, inplace=True)

输出：

   user  value   loan  date  cummean_value
0     1     10   True    23            0.0
1     1     20   True    24           10.0
2     1     30   True    25           15.0
3     1     40  False    26           20.0
4     2      1   True    27            0.0
5     2      2  False    28            1.0
6     2      3   True    29            1.0

基于以 False 作为 Pandas 中的最新值的布尔列扩展均值

Expanding mean based on boolean column with False as most recent value in Pandas

python

dataframe

pandas

cumsum

rolling-computation