如何在跳过前两行的同时计算列的累计和?
How do I compute the cumlative sum of a column while skipping the first two rows?
我有一个 pandas 数据框,看起来像:
capacity_gw marginal_cost chained_capacity
Case Category
CES - No Storage Hydro 4.277016 0.000000 NaN
Solar 9.774715 0.000000 NaN
Wind 11.881870 0.000000 4.277016
Nuclear 5.242805 12.689066 14.051731
NGCC 2.101907 25.109150 25.933600
NGGT 4.638107 32.703513 31.176405
Overflow 35.000000 169.679554 33.278312
CES - Storage Hydro 4.277016 0.000000 37.916419
Solar 9.774715 0.000000 72.916419
Wind 11.881869 0.000000 4.277016
Nuclear 5.242805 12.689066 14.051731
NGCC 2.101907 25.109150 25.933600
NGGT 2.101907 32.703513 31.176405
Overflow 35.000000 169.679554 33.278312
Reference - No Storage Hydro 4.277016 0.000000 35.380219
Solar 14.289311 0.000000 70.380219
Wind 10.435570 0.000000 4.277016
Nuclear 1.143500 12.689066 18.566327
NGCC 4.533380 25.109150 29.001897
NGGT 17.224408 32.703513 30.145397
Overflow 35.000000 169.679554 34.678777
Reference - Storage Hydro 4.277016 0.000000 51.903185
Solar 14.894274 0.000000 86.903185
Wind 10.435570 0.000000 4.277016
Nuclear 1.143500 12.689066 19.171290
NGCC 4.533380 25.109150 29.606860
NGGT 14.524706 32.703513 30.750360
Overflow 35.000000 169.679554 35.283740
我使用以下方法创建了 chained_capacity
变量:
stack['chained_capacity'] = stack.groupby('Case')['capacity_gw'].cumsum().shift(2)
但这不是我想要的结果。如您所见,它仍然从列中的第一个初始值开始求和。我希望总和从第三个值开始。所以预期的输出是:
capacity_gw marginal_cost chained_capacity
Case Category
CES - No Storage Hydro 4.277016 0.000000 NaN
Solar 9.774715 0.000000 NaN
Wind 11.881870 0.000000 11.881870
Nuclear 5.242805 12.689066 17.124674
NGCC 2.101907 25.109150 17.12 + 2.10
NGGT 4.638107 32.703513 ...
Overflow 35.000000 169.679554 ...
...
这里是df.to_dict()
能够完整重现数据:
{'capacity_gw': {('Reference - No Storage', 'Solar'): 14.289311043873823, ('Reference - No Storage', 'Wind'): 10.43556981658827, ('Reference - No Storage', 'Hydro'): 4.277016, ('Reference - No Storage', 'Nuclear'): 1.1435, ('Reference - No Storage', 'NGCC'): 4.533380090390558, ('Reference - No Storage', 'NGGT'): 17.22440836569597, ('Reference - No Storage', 'Overflow'): 35.0, ('Reference - Storage', 'Solar'): 14.894274398144354, ('Reference - Storage', 'Wind'): 10.435569838806854, ('Reference - Storage', 'Hydro'): 4.277016, ('Reference - Storage', 'Nuclear'): 1.1435, ('Reference - Storage', 'NGCC'): 4.533380082818851, ('Reference - Storage', 'NGGT'): 14.524706430121823, ('Reference - Storage', 'Overflow'): 35.0, ('CES - No Storage', 'Solar'): 9.774714739869358, ('CES - No Storage', 'Wind'): 11.881869635856951, ('CES - No Storage', 'Hydro'): 4.277016, ('CES - No Storage', 'Nuclear'): 5.242805, ('CES - No Storage', 'NGCC'): 2.1019069999999997, ('CES - No Storage', 'NGGT'): 4.638107074198996, ('CES - No Storage', 'Overflow'): 35.0, ('CES - Storage', 'Solar'): 9.774714538236491, ('CES - Storage', 'Wind'): 11.881869305881622, ('CES - Storage', 'Hydro'): 4.277016, ('CES - Storage', 'Nuclear'): 5.242805, ('CES - Storage', 'NGCC'): 2.1019069999999997, ('CES - Storage', 'NGGT'): 2.1019069999999997, ('CES - Storage', 'Overflow'): 35.0}, 'marginal_cost': {('Reference - No Storage', 'Solar'): 0.0, ('Reference - No Storage', 'Wind'): 0.0, ('Reference - No Storage', 'Hydro'): 0.0, ('Reference - No Storage', 'Nuclear'): 12.68906562274404, ('Reference - No Storage', 'NGCC'): 25.10914978408783, ('Reference - No Storage', 'NGGT'): 32.703513055654646, ('Reference - No Storage', 'Overflow'): 169.6795540944021, ('Reference - Storage', 'Solar'): 0.0, ('Reference - Storage', 'Wind'): 0.0, ('Reference - Storage', 'Hydro'): 0.0, ('Reference - Storage', 'Nuclear'): 12.68906562274404, ('Reference - Storage', 'NGCC'): 25.10914978408783, ('Reference - Storage', 'NGGT'): 32.703513055654646, ('Reference - Storage', 'Overflow'): 169.6795540944021, ('CES - No Storage', 'Solar'): 0.0, ('CES - No Storage', 'Wind'): 0.0, ('CES - No Storage', 'Hydro'): 0.0, ('CES - No Storage', 'Nuclear'): 12.68906562274404, ('CES - No Storage', 'NGCC'): 25.10914978408783, ('CES - No Storage', 'NGGT'): 32.703513055654646, ('CES - No Storage', 'Overflow'): 169.6795540944021, ('CES - Storage', 'Solar'): 0.0, ('CES - Storage', 'Wind'): 0.0, ('CES - Storage', 'Hydro'): 0.0, ('CES - Storage', 'Nuclear'): 12.68906562274404, ('CES - Storage', 'NGCC'): 25.10914978408783, ('CES - Storage', 'NGGT'): 32.703513055654646, ('CES - Storage', 'Overflow'): 169.6795540944021}}
使用中的技巧,
# this transform may be slow for large dataframes
stack['chained_capacity'] = \
stack.groupby('Case')['capacity_gw'].transform(lambda x: x.cumsum().shift(2))
或
# creates a temporary column; should be fast/scalable for large df
stack['temp'] = stack.groupby('Case')['capacity_gw'].cumsum()
stack['chained_capacity'] = stack.groupby('Case')['temp'].shift(2)
stack = stack.drop(columns=['temp'])
我有一个 pandas 数据框,看起来像:
capacity_gw marginal_cost chained_capacity
Case Category
CES - No Storage Hydro 4.277016 0.000000 NaN
Solar 9.774715 0.000000 NaN
Wind 11.881870 0.000000 4.277016
Nuclear 5.242805 12.689066 14.051731
NGCC 2.101907 25.109150 25.933600
NGGT 4.638107 32.703513 31.176405
Overflow 35.000000 169.679554 33.278312
CES - Storage Hydro 4.277016 0.000000 37.916419
Solar 9.774715 0.000000 72.916419
Wind 11.881869 0.000000 4.277016
Nuclear 5.242805 12.689066 14.051731
NGCC 2.101907 25.109150 25.933600
NGGT 2.101907 32.703513 31.176405
Overflow 35.000000 169.679554 33.278312
Reference - No Storage Hydro 4.277016 0.000000 35.380219
Solar 14.289311 0.000000 70.380219
Wind 10.435570 0.000000 4.277016
Nuclear 1.143500 12.689066 18.566327
NGCC 4.533380 25.109150 29.001897
NGGT 17.224408 32.703513 30.145397
Overflow 35.000000 169.679554 34.678777
Reference - Storage Hydro 4.277016 0.000000 51.903185
Solar 14.894274 0.000000 86.903185
Wind 10.435570 0.000000 4.277016
Nuclear 1.143500 12.689066 19.171290
NGCC 4.533380 25.109150 29.606860
NGGT 14.524706 32.703513 30.750360
Overflow 35.000000 169.679554 35.283740
我使用以下方法创建了 chained_capacity
变量:
stack['chained_capacity'] = stack.groupby('Case')['capacity_gw'].cumsum().shift(2)
但这不是我想要的结果。如您所见,它仍然从列中的第一个初始值开始求和。我希望总和从第三个值开始。所以预期的输出是:
capacity_gw marginal_cost chained_capacity
Case Category
CES - No Storage Hydro 4.277016 0.000000 NaN
Solar 9.774715 0.000000 NaN
Wind 11.881870 0.000000 11.881870
Nuclear 5.242805 12.689066 17.124674
NGCC 2.101907 25.109150 17.12 + 2.10
NGGT 4.638107 32.703513 ...
Overflow 35.000000 169.679554 ...
...
这里是df.to_dict()
能够完整重现数据:
{'capacity_gw': {('Reference - No Storage', 'Solar'): 14.289311043873823, ('Reference - No Storage', 'Wind'): 10.43556981658827, ('Reference - No Storage', 'Hydro'): 4.277016, ('Reference - No Storage', 'Nuclear'): 1.1435, ('Reference - No Storage', 'NGCC'): 4.533380090390558, ('Reference - No Storage', 'NGGT'): 17.22440836569597, ('Reference - No Storage', 'Overflow'): 35.0, ('Reference - Storage', 'Solar'): 14.894274398144354, ('Reference - Storage', 'Wind'): 10.435569838806854, ('Reference - Storage', 'Hydro'): 4.277016, ('Reference - Storage', 'Nuclear'): 1.1435, ('Reference - Storage', 'NGCC'): 4.533380082818851, ('Reference - Storage', 'NGGT'): 14.524706430121823, ('Reference - Storage', 'Overflow'): 35.0, ('CES - No Storage', 'Solar'): 9.774714739869358, ('CES - No Storage', 'Wind'): 11.881869635856951, ('CES - No Storage', 'Hydro'): 4.277016, ('CES - No Storage', 'Nuclear'): 5.242805, ('CES - No Storage', 'NGCC'): 2.1019069999999997, ('CES - No Storage', 'NGGT'): 4.638107074198996, ('CES - No Storage', 'Overflow'): 35.0, ('CES - Storage', 'Solar'): 9.774714538236491, ('CES - Storage', 'Wind'): 11.881869305881622, ('CES - Storage', 'Hydro'): 4.277016, ('CES - Storage', 'Nuclear'): 5.242805, ('CES - Storage', 'NGCC'): 2.1019069999999997, ('CES - Storage', 'NGGT'): 2.1019069999999997, ('CES - Storage', 'Overflow'): 35.0}, 'marginal_cost': {('Reference - No Storage', 'Solar'): 0.0, ('Reference - No Storage', 'Wind'): 0.0, ('Reference - No Storage', 'Hydro'): 0.0, ('Reference - No Storage', 'Nuclear'): 12.68906562274404, ('Reference - No Storage', 'NGCC'): 25.10914978408783, ('Reference - No Storage', 'NGGT'): 32.703513055654646, ('Reference - No Storage', 'Overflow'): 169.6795540944021, ('Reference - Storage', 'Solar'): 0.0, ('Reference - Storage', 'Wind'): 0.0, ('Reference - Storage', 'Hydro'): 0.0, ('Reference - Storage', 'Nuclear'): 12.68906562274404, ('Reference - Storage', 'NGCC'): 25.10914978408783, ('Reference - Storage', 'NGGT'): 32.703513055654646, ('Reference - Storage', 'Overflow'): 169.6795540944021, ('CES - No Storage', 'Solar'): 0.0, ('CES - No Storage', 'Wind'): 0.0, ('CES - No Storage', 'Hydro'): 0.0, ('CES - No Storage', 'Nuclear'): 12.68906562274404, ('CES - No Storage', 'NGCC'): 25.10914978408783, ('CES - No Storage', 'NGGT'): 32.703513055654646, ('CES - No Storage', 'Overflow'): 169.6795540944021, ('CES - Storage', 'Solar'): 0.0, ('CES - Storage', 'Wind'): 0.0, ('CES - Storage', 'Hydro'): 0.0, ('CES - Storage', 'Nuclear'): 12.68906562274404, ('CES - Storage', 'NGCC'): 25.10914978408783, ('CES - Storage', 'NGGT'): 32.703513055654646, ('CES - Storage', 'Overflow'): 169.6795540944021}}
使用
# this transform may be slow for large dataframes
stack['chained_capacity'] = \
stack.groupby('Case')['capacity_gw'].transform(lambda x: x.cumsum().shift(2))
或
# creates a temporary column; should be fast/scalable for large df
stack['temp'] = stack.groupby('Case')['capacity_gw'].cumsum()
stack['chained_capacity'] = stack.groupby('Case')['temp'].shift(2)
stack = stack.drop(columns=['temp'])