仅适用于 1 列的累计总和 python
Cumulative sum only applying on 1 column python
我只想在 1 个特定列上应用 cumsum,因为我在不同列中有其他值必须保持不变。
这是我目前的脚本
df.groupby(by=['name','day']).sum().groupby(level=[0]).cumsum()
然而,此脚本会导致我的 pandas df 中的所有列都将累积。唯一必须累加总和的列是 data
.
根据要求,这里有一些示例数据:
df = pd.DataFrame({'ID': ["880022443344556677787", "880022443344556677782", "880022443344556677787",
"880022443344556677782", "880022443344556677787", "880022443344556677782",
"880022443344556677781"],
'Month': ["201701", "201701", "201702", "201702", "201703", "201703", "201703"],
'Usage': [20, 40, 100, 50, 30, 30, 2000],
'Sec': [10, 15, 20, 1, 5, 6, 30]})
ID Month Sec Usage
0 880022443344556677787 201701 10 20
1 880022443344556677782 201701 15 40
2 880022443344556677787 201702 20 100
3 880022443344556677782 201702 1 50
4 880022443344556677787 201703 5 30
5 880022443344556677782 201703 6 30
6 880022443344556677781 201703 30 2000
期望的输出
ID Month Sec Usage
0 880022443344556677787 201701 10 20
1 880022443344556677782 201701 15 40
2 880022443344556677787 201702 20 120
3 880022443344556677782 201702 1 90
4 880022443344556677787 201703 5 150
5 880022443344556677782 201703 6 120
6 880022443344556677781 201703 30 2000
我认为你需要 set_index
用于不需要 cumsum
的列 - 我通过 list comprehension
:
动态找到它们
cumsum_col = 'Usage'
df1 = df.groupby(by=['ID','Month'], sort=False).sum()
cols = [col for col in df1.columns if col != cumsum_col]
df1 = df1.set_index(cols, append=True).groupby(level=[0]).cumsum().reset_index()
print (df1)
ID Month Sec Usage
0 880022443344556677787 201701 10 20
1 880022443344556677782 201701 15 40
2 880022443344556677787 201702 20 120
3 880022443344556677782 201702 1 90
4 880022443344556677787 201703 5 150
5 880022443344556677782 201703 6 120
6 880022443344556677781 201703 30 2000
编辑:
cumsum_col = 'Usage'
df2 = df.groupby(by=['ID','Month'], sort=False).sum()
cols = [col for col in df2.columns if col != cumsum_col]
df1 = df2.set_index(cols, append=True).groupby(level=[0]).cumsum()
df1 = df2.assign(Usage_cumsum = df1.reset_index(level=2, drop=True)).reset_index()
print (df1)
ID Month Sec Usage Usage_cumsum
0 880022443344556677787 201701 10 20 20
1 880022443344556677782 201701 15 40 40
2 880022443344556677787 201702 20 100 120
3 880022443344556677782 201702 1 50 90
4 880022443344556677787 201703 5 30 150
5 880022443344556677782 201703 6 30 120
6 880022443344556677781 201703 30 2000 2000
编辑 1:
在你的示例数据中没有聚合 sum
,所以数据有点修改(解决方案类似,但与另一个不同):
df = pd.DataFrame({'ID': ["880022443344556677787", "880022443344556677782", "880022443344556677787",
"880022443344556677782", "880022443344556677787", "880022443344556677782",
"880022443344556677781"],
'Month': ["201701", "201701", "201701", "201702", "201703", "201701", "201703"],
'Usage': [20, 40, 100, 50, 30, 30, 2000],
'Sec': [10, 15, 20, 1, 5, 6, 30]})
print (df)
ID Month Sec Usage
0 880022443344556677787 201701 10 20
1 880022443344556677782 201701 15 40
2 880022443344556677787 201701 20 100
3 880022443344556677782 201702 1 50
4 880022443344556677787 201703 5 30
5 880022443344556677782 201701 6 30
6 880022443344556677781 201703 30 2000
#aggregate sum to all columns
df1 = df.groupby(['ID', 'Month']).sum()
print (df1)
Sec Usage
ID Month
880022443344556677781 201703 30 2000
880022443344556677782 201701 21 70
201702 1 50
880022443344556677787 201701 30 120
201703 5 30
#aggregate cumcum to Usage column only
s = df1.groupby(level=0)['Usage'].cumsum()
print (s)
ID Month
880022443344556677781 201703 2000
880022443344556677782 201701 70
201702 120
880022443344556677787 201701 120
201703 150
Name: Usage, dtype: int64
#join cumsum series to aggregate df1
df3 = df1.join(s, rsuffix='_cumsum').reset_index()
print (df3)
ID Month Sec Usage Usage_cumsum
0 880022443344556677781 201703 30 2000 2000
1 880022443344556677782 201701 21 70 70
2 880022443344556677782 201702 1 50 120
3 880022443344556677787 201701 30 120 120
4 880022443344556677787 201703 5 30 150
考虑数据框 df
df = pd.DataFrame(dict(
name=list('aaaaaaaabbbbbbbb'),
day=np.tile(np.arange(2).repeat(4), 2),
data=np.arange(16)
))
首先,您通过在 groupby
语句之后命名列来对特定列执行 cumsum
。
其次,您可以使用 join
将其添加回数据框 df
d2 = df.groupby(['name', 'day']).data.sum().groupby(level=0).cumsum()
df.join(d2, on=['name', 'day'], rsuffix='_cum')
data day name data_cum
0 0 0 a 6
1 1 0 a 6
2 2 0 a 6
3 3 0 a 6
4 4 1 a 28
5 5 1 a 28
6 6 1 a 28
7 7 1 a 28
8 8 0 b 38
9 9 0 b 38
10 10 0 b 38
11 11 0 b 38
12 12 1 b 92
13 13 1 b 92
14 14 1 b 92
15 15 1 b 92
您已经可以将累计总和 ('cumsum'
) 作为 df.groupby
的聚合。您需要将 'cumsum'
作为字符串作为聚合函数提供给 'data' 列。
df.groupby(['name','day']).agg({'data': 'cumsum'})
我只想在 1 个特定列上应用 cumsum,因为我在不同列中有其他值必须保持不变。
这是我目前的脚本
df.groupby(by=['name','day']).sum().groupby(level=[0]).cumsum()
然而,此脚本会导致我的 pandas df 中的所有列都将累积。唯一必须累加总和的列是 data
.
根据要求,这里有一些示例数据:
df = pd.DataFrame({'ID': ["880022443344556677787", "880022443344556677782", "880022443344556677787",
"880022443344556677782", "880022443344556677787", "880022443344556677782",
"880022443344556677781"],
'Month': ["201701", "201701", "201702", "201702", "201703", "201703", "201703"],
'Usage': [20, 40, 100, 50, 30, 30, 2000],
'Sec': [10, 15, 20, 1, 5, 6, 30]})
ID Month Sec Usage
0 880022443344556677787 201701 10 20
1 880022443344556677782 201701 15 40
2 880022443344556677787 201702 20 100
3 880022443344556677782 201702 1 50
4 880022443344556677787 201703 5 30
5 880022443344556677782 201703 6 30
6 880022443344556677781 201703 30 2000
期望的输出
ID Month Sec Usage
0 880022443344556677787 201701 10 20
1 880022443344556677782 201701 15 40
2 880022443344556677787 201702 20 120
3 880022443344556677782 201702 1 90
4 880022443344556677787 201703 5 150
5 880022443344556677782 201703 6 120
6 880022443344556677781 201703 30 2000
我认为你需要 set_index
用于不需要 cumsum
的列 - 我通过 list comprehension
:
cumsum_col = 'Usage'
df1 = df.groupby(by=['ID','Month'], sort=False).sum()
cols = [col for col in df1.columns if col != cumsum_col]
df1 = df1.set_index(cols, append=True).groupby(level=[0]).cumsum().reset_index()
print (df1)
ID Month Sec Usage
0 880022443344556677787 201701 10 20
1 880022443344556677782 201701 15 40
2 880022443344556677787 201702 20 120
3 880022443344556677782 201702 1 90
4 880022443344556677787 201703 5 150
5 880022443344556677782 201703 6 120
6 880022443344556677781 201703 30 2000
编辑:
cumsum_col = 'Usage'
df2 = df.groupby(by=['ID','Month'], sort=False).sum()
cols = [col for col in df2.columns if col != cumsum_col]
df1 = df2.set_index(cols, append=True).groupby(level=[0]).cumsum()
df1 = df2.assign(Usage_cumsum = df1.reset_index(level=2, drop=True)).reset_index()
print (df1)
ID Month Sec Usage Usage_cumsum
0 880022443344556677787 201701 10 20 20
1 880022443344556677782 201701 15 40 40
2 880022443344556677787 201702 20 100 120
3 880022443344556677782 201702 1 50 90
4 880022443344556677787 201703 5 30 150
5 880022443344556677782 201703 6 30 120
6 880022443344556677781 201703 30 2000 2000
编辑 1:
在你的示例数据中没有聚合 sum
,所以数据有点修改(解决方案类似,但与另一个不同):
df = pd.DataFrame({'ID': ["880022443344556677787", "880022443344556677782", "880022443344556677787",
"880022443344556677782", "880022443344556677787", "880022443344556677782",
"880022443344556677781"],
'Month': ["201701", "201701", "201701", "201702", "201703", "201701", "201703"],
'Usage': [20, 40, 100, 50, 30, 30, 2000],
'Sec': [10, 15, 20, 1, 5, 6, 30]})
print (df)
ID Month Sec Usage
0 880022443344556677787 201701 10 20
1 880022443344556677782 201701 15 40
2 880022443344556677787 201701 20 100
3 880022443344556677782 201702 1 50
4 880022443344556677787 201703 5 30
5 880022443344556677782 201701 6 30
6 880022443344556677781 201703 30 2000
#aggregate sum to all columns
df1 = df.groupby(['ID', 'Month']).sum()
print (df1)
Sec Usage
ID Month
880022443344556677781 201703 30 2000
880022443344556677782 201701 21 70
201702 1 50
880022443344556677787 201701 30 120
201703 5 30
#aggregate cumcum to Usage column only
s = df1.groupby(level=0)['Usage'].cumsum()
print (s)
ID Month
880022443344556677781 201703 2000
880022443344556677782 201701 70
201702 120
880022443344556677787 201701 120
201703 150
Name: Usage, dtype: int64
#join cumsum series to aggregate df1
df3 = df1.join(s, rsuffix='_cumsum').reset_index()
print (df3)
ID Month Sec Usage Usage_cumsum
0 880022443344556677781 201703 30 2000 2000
1 880022443344556677782 201701 21 70 70
2 880022443344556677782 201702 1 50 120
3 880022443344556677787 201701 30 120 120
4 880022443344556677787 201703 5 30 150
考虑数据框 df
df = pd.DataFrame(dict(
name=list('aaaaaaaabbbbbbbb'),
day=np.tile(np.arange(2).repeat(4), 2),
data=np.arange(16)
))
首先,您通过在 groupby
语句之后命名列来对特定列执行 cumsum
。
其次,您可以使用 join
df
d2 = df.groupby(['name', 'day']).data.sum().groupby(level=0).cumsum()
df.join(d2, on=['name', 'day'], rsuffix='_cum')
data day name data_cum
0 0 0 a 6
1 1 0 a 6
2 2 0 a 6
3 3 0 a 6
4 4 1 a 28
5 5 1 a 28
6 6 1 a 28
7 7 1 a 28
8 8 0 b 38
9 9 0 b 38
10 10 0 b 38
11 11 0 b 38
12 12 1 b 92
13 13 1 b 92
14 14 1 b 92
15 15 1 b 92
您已经可以将累计总和 ('cumsum'
) 作为 df.groupby
的聚合。您需要将 'cumsum'
作为字符串作为聚合函数提供给 'data' 列。
df.groupby(['name','day']).agg({'data': 'cumsum'})