如何在跳过前两行的同时计算列的累计和？

Question

我有一个 pandas 数据框，看起来像：

                                 capacity_gw  marginal_cost  chained_capacity
Case                   Category
CES - No Storage       Hydro        4.277016       0.000000               NaN
                       Solar        9.774715       0.000000               NaN
                       Wind        11.881870       0.000000          4.277016
                       Nuclear      5.242805      12.689066         14.051731
                       NGCC         2.101907      25.109150         25.933600
                       NGGT         4.638107      32.703513         31.176405
                       Overflow    35.000000     169.679554         33.278312
CES - Storage          Hydro        4.277016       0.000000         37.916419
                       Solar        9.774715       0.000000         72.916419
                       Wind        11.881869       0.000000          4.277016
                       Nuclear      5.242805      12.689066         14.051731
                       NGCC         2.101907      25.109150         25.933600
                       NGGT         2.101907      32.703513         31.176405
                       Overflow    35.000000     169.679554         33.278312
Reference - No Storage Hydro        4.277016       0.000000         35.380219
                       Solar       14.289311       0.000000         70.380219
                       Wind        10.435570       0.000000          4.277016
                       Nuclear      1.143500      12.689066         18.566327
                       NGCC         4.533380      25.109150         29.001897
                       NGGT        17.224408      32.703513         30.145397
                       Overflow    35.000000     169.679554         34.678777
Reference - Storage    Hydro        4.277016       0.000000         51.903185
                       Solar       14.894274       0.000000         86.903185
                       Wind        10.435570       0.000000          4.277016
                       Nuclear      1.143500      12.689066         19.171290
                       NGCC         4.533380      25.109150         29.606860
                       NGGT        14.524706      32.703513         30.750360
                       Overflow    35.000000     169.679554         35.283740

我使用以下方法创建了 chained_capacity 变量：

stack['chained_capacity'] = stack.groupby('Case')['capacity_gw'].cumsum().shift(2)

但这不是我想要的结果。如您所见，它仍然从列中的第一个初始值开始求和。我希望总和从第三个值开始。所以预期的输出是：

                                 capacity_gw  marginal_cost  chained_capacity
Case                   Category
CES - No Storage       Hydro        4.277016       0.000000               NaN
                       Solar        9.774715       0.000000               NaN
                       Wind        11.881870       0.000000         11.881870
                       Nuclear      5.242805      12.689066         17.124674
                       NGCC         2.101907      25.109150         17.12 + 2.10
                       NGGT         4.638107      32.703513         ...
                       Overflow    35.000000     169.679554         ...
...

这里是df.to_dict()能够完整重现数据：

{'capacity_gw': {('Reference - No Storage', 'Solar'): 14.289311043873823, ('Reference - No Storage', 'Wind'): 10.43556981658827, ('Reference - No Storage', 'Hydro'): 4.277016, ('Reference - No Storage', 'Nuclear'): 1.1435, ('Reference - No Storage', 'NGCC'): 4.533380090390558, ('Reference - No Storage', 'NGGT'): 17.22440836569597, ('Reference - No Storage', 'Overflow'): 35.0, ('Reference - Storage', 'Solar'): 14.894274398144354, ('Reference - Storage', 'Wind'): 10.435569838806854, ('Reference - Storage', 'Hydro'): 4.277016, ('Reference - Storage', 'Nuclear'): 1.1435, ('Reference - Storage', 'NGCC'): 4.533380082818851, ('Reference - Storage', 'NGGT'): 14.524706430121823, ('Reference - Storage', 'Overflow'): 35.0, ('CES - No Storage', 'Solar'): 9.774714739869358, ('CES - No Storage', 'Wind'): 11.881869635856951, ('CES - No Storage', 'Hydro'): 4.277016, ('CES - No Storage', 'Nuclear'): 5.242805, ('CES - No Storage', 'NGCC'): 2.1019069999999997, ('CES - No Storage', 'NGGT'): 4.638107074198996, ('CES - No Storage', 'Overflow'): 35.0, ('CES - Storage', 'Solar'): 9.774714538236491, ('CES - Storage', 'Wind'): 11.881869305881622, ('CES - Storage', 'Hydro'): 4.277016, ('CES - Storage', 'Nuclear'): 5.242805, ('CES - Storage', 'NGCC'): 2.1019069999999997, ('CES - Storage', 'NGGT'): 2.1019069999999997, ('CES - Storage', 'Overflow'): 35.0}, 'marginal_cost': {('Reference - No Storage', 'Solar'): 0.0, ('Reference - No Storage', 'Wind'): 0.0, ('Reference - No Storage', 'Hydro'): 0.0, ('Reference - No Storage', 'Nuclear'): 12.68906562274404, ('Reference - No Storage', 'NGCC'): 25.10914978408783, ('Reference - No Storage', 'NGGT'): 32.703513055654646, ('Reference - No Storage', 'Overflow'): 169.6795540944021, ('Reference - Storage', 'Solar'): 0.0, ('Reference - Storage', 'Wind'): 0.0, ('Reference - Storage', 'Hydro'): 0.0, ('Reference - Storage', 'Nuclear'): 12.68906562274404, ('Reference - Storage', 'NGCC'): 25.10914978408783, ('Reference - Storage', 'NGGT'): 32.703513055654646, ('Reference - Storage', 'Overflow'): 169.6795540944021, ('CES - No Storage', 'Solar'): 0.0, ('CES - No Storage', 'Wind'): 0.0, ('CES - No Storage', 'Hydro'): 0.0, ('CES - No Storage', 'Nuclear'): 12.68906562274404, ('CES - No Storage', 'NGCC'): 25.10914978408783, ('CES - No Storage', 'NGGT'): 32.703513055654646, ('CES - No Storage', 'Overflow'): 169.6795540944021, ('CES - Storage', 'Solar'): 0.0, ('CES - Storage', 'Wind'): 0.0, ('CES - Storage', 'Hydro'): 0.0, ('CES - Storage', 'Nuclear'): 12.68906562274404, ('CES - Storage', 'NGCC'): 25.10914978408783, ('CES - Storage', 'NGGT'): 32.703513055654646, ('CES - Storage', 'Overflow'): 169.6795540944021}}

Answer 1

使用中的技巧，

# this transform may be slow for large dataframes
stack['chained_capacity'] = \
    stack.groupby('Case')['capacity_gw'].transform(lambda x: x.cumsum().shift(2))

或

# creates a temporary column; should be fast/scalable for large df
stack['temp'] = stack.groupby('Case')['capacity_gw'].cumsum()
stack['chained_capacity'] = stack.groupby('Case')['temp'].shift(2)
stack = stack.drop(columns=['temp'])

如何在跳过前两行的同时计算列的累计和？

How do I compute the cumlative sum of a column while skipping the first two rows?

python

cumulative-sum

pandas