pandas 在列中填充前几行的累积总和(在每个 nan 后重置)
pandas fillna in column with cumsum of previous rows (reset after every nan)
我找到了一个按行解决这个问题的解决方案,但是有没有一种快速的方法可以按列解决这个问题?
这是数据帧的快速示例:
import pandas as pd
import numpy as np
df = pd.DataFrame([['GB',43.76],
['TEN',17.3],
['ARI',0.2],
['ATL',12.3],
['HOU',21.1],
['ARI',1.7],
['ATL',12.6],
['SF',15.0],
['GB',5.7],
[1.0,np.nan],
['GB',43.76],
['TEN',17.3],
['ARI',0.2],
['ATL',12.3],
['HOU',21.1],
['ARI',1.7],
['ATL',12.6],
['BUF',7.0],
['GB',5.7],
[2.0,np.nan]], columns = ['team','points'])
我一直在试图操纵 df['sum'] = df['points'].cumsum()
。显然它计算了累积总和,但我需要它做的是重新启动 when/if 得到一个 nan
而不是跳过它。
将 GroupBy.cumsum
与通过另一个 cumsum
检查缺失值创建的助手系列一起使用 cumsum
:
df['sum'] = df.groupby(df['points'].isna().cumsum())['points'].cumsum()
print (df)
team points sum
0 GB 43.76 43.76
1 TEN 17.30 61.06
2 ARI 0.20 61.26
3 ATL 12.30 73.56
4 HOU 21.10 94.66
5 ARI 1.70 96.36
6 ATL 12.60 108.96
7 SF 15.00 123.96
8 GB 5.70 129.66
9 1 NaN NaN
10 GB 43.76 43.76
11 TEN 17.30 61.06
12 ARI 0.20 61.26
13 ATL 12.30 73.56
14 HOU 21.10 94.66
15 ARI 1.70 96.36
16 ATL 12.60 108.96
17 BUF 7.00 115.96
18 GB 5.70 121.66
19 2 NaN NaN
不确定这是否与 jezrael 的解决方案相同,但我建议创建一个代表求和组的列,如在 中,您在其中检查 np.nan 而不是 0。然后对这些求和组进行累加。
另一种不使用 groupby
并假设 所有点都是正数的方法 ,你可以用 cumsum
点和 ffill
具有先前值的 nan,然后删除指向 isna
的值的 cummax
,例如:
df['s'] = df['points'].cumsum().ffill()
df['s'] -= (df['s']*df['points'].isna()).cummax()
print (df)
team points s
0 GB 43.76 43.76
1 TEN 17.30 61.06
2 ARI 0.20 61.26
3 ATL 12.30 73.56
4 HOU 21.10 94.66
5 ARI 1.70 96.36
6 ATL 12.60 108.96
7 SF 15.00 123.96
8 GB 5.70 129.66
9 1 NaN 0.00
10 GB 43.76 43.76
11 TEN 17.30 61.06
12 ARI 0.20 61.26
13 ATL 12.30 73.56
14 HOU 21.10 94.66
15 ARI 1.70 96.36
16 ATL 12.60 108.96
17 BUF 7.00 115.96
18 GB 5.70 121.66
19 2 NaN 0.00
我找到了一个按行解决这个问题的解决方案,但是有没有一种快速的方法可以按列解决这个问题?
这是数据帧的快速示例:
import pandas as pd
import numpy as np
df = pd.DataFrame([['GB',43.76],
['TEN',17.3],
['ARI',0.2],
['ATL',12.3],
['HOU',21.1],
['ARI',1.7],
['ATL',12.6],
['SF',15.0],
['GB',5.7],
[1.0,np.nan],
['GB',43.76],
['TEN',17.3],
['ARI',0.2],
['ATL',12.3],
['HOU',21.1],
['ARI',1.7],
['ATL',12.6],
['BUF',7.0],
['GB',5.7],
[2.0,np.nan]], columns = ['team','points'])
我一直在试图操纵 df['sum'] = df['points'].cumsum()
。显然它计算了累积总和,但我需要它做的是重新启动 when/if 得到一个 nan
而不是跳过它。
将 GroupBy.cumsum
与通过另一个 cumsum
检查缺失值创建的助手系列一起使用 cumsum
:
df['sum'] = df.groupby(df['points'].isna().cumsum())['points'].cumsum()
print (df)
team points sum
0 GB 43.76 43.76
1 TEN 17.30 61.06
2 ARI 0.20 61.26
3 ATL 12.30 73.56
4 HOU 21.10 94.66
5 ARI 1.70 96.36
6 ATL 12.60 108.96
7 SF 15.00 123.96
8 GB 5.70 129.66
9 1 NaN NaN
10 GB 43.76 43.76
11 TEN 17.30 61.06
12 ARI 0.20 61.26
13 ATL 12.30 73.56
14 HOU 21.10 94.66
15 ARI 1.70 96.36
16 ATL 12.60 108.96
17 BUF 7.00 115.96
18 GB 5.70 121.66
19 2 NaN NaN
不确定这是否与 jezrael 的解决方案相同,但我建议创建一个代表求和组的列,如在
另一种不使用 groupby
并假设 所有点都是正数的方法 ,你可以用 cumsum
点和 ffill
具有先前值的 nan,然后删除指向 isna
的值的 cummax
,例如:
df['s'] = df['points'].cumsum().ffill()
df['s'] -= (df['s']*df['points'].isna()).cummax()
print (df)
team points s
0 GB 43.76 43.76
1 TEN 17.30 61.06
2 ARI 0.20 61.26
3 ATL 12.30 73.56
4 HOU 21.10 94.66
5 ARI 1.70 96.36
6 ATL 12.60 108.96
7 SF 15.00 123.96
8 GB 5.70 129.66
9 1 NaN 0.00
10 GB 43.76 43.76
11 TEN 17.30 61.06
12 ARI 0.20 61.26
13 ATL 12.30 73.56
14 HOU 21.10 94.66
15 ARI 1.70 96.36
16 ATL 12.60 108.96
17 BUF 7.00 115.96
18 GB 5.70 121.66
19 2 NaN 0.00