Python Pandas 中的解析滑动 Windows 函数
Analytic Sliding Windows function in Python Pandas
有table:
list_1= [['2016-01-01',1,'King', 1000],
['2016-01-02',1,'King', -200],
['2016-01-03',1,'King', 100],
['2016-01-04',1,'King',-400],
['2016-01-05',1,'King', 200],
['2016-01-06',1,'King', -200],
['2016-01-01',2,'Smith', 1000],
['2016-01-02',2,'Smith', -300],
['2016-01-03',2,'Smith', -600],
['2016-01-04',2,'Smith', 100],
['2016-01-05',2,'Smith', -100]]
labels=['a_date','c_id','c_name','c_action']
df=pd.DataFrame(list_1,columns=labels)
df
输出:
a_date c_id c_name c_action
0 2016-01-01 1 King 1000
1 2016-01-02 1 King -200
2 2016-01-03 1 King 100
3 2016-01-04 1 King -400
4 2016-01-05 1 King 200
5 2016-01-06 1 King -200
6 2016-01-01 2 Smith 1000
7 2016-01-02 2 Smith -300
8 2016-01-03 2 Smith -600
9 2016-01-04 2 Smith 100
10 2016-01-05 2 Smith -100
需要得到table:
a_date c_id c_name c_amount Balance
2016-01-01 1 King 1000 1000
2016-01-02 1 King -200 800
2016-01-03 1 King 100 900
2016-01-04 1 King -400 500
2016-01-05 1 King 200 700
2016-01-06 1 King -200 500
2016-01-01 2 Smith 1000 1000
2016-01-02 2 Smith -300 700
2016-01-03 2 Smith -600 100
2016-01-04 2 Smith 100 200
2016-01-05 2 Smith -100 100
所以我需要在每个客户的每次操作后制作 "Balance" 列,其中包含累计金额。
这相当于 SQL 查询:
SELECT *,
SUM(c_amount) OVER (PARTITION BY c_id ORDER BY a_date) AS 'Balance'
FROM account_actions
对于两个客户的解决方案都不难,可以将table除以c_id,总结并合并back.But应该是10000个客户的动态解决方案...
正如@Vaishali 评论的那样,这是 groupby
和 cumsum
。您可能想要执行 sort_values
以确保数据按顺序排序,尽管它看起来已经如此:
# sort by `c_id` and `a_date`
df = df.sort_values(['c_id','a_date'])
df['balance'] = df.groupby('c_id')['c_action'].cumsum()
输出:
a_date c_id c_name c_action balance
0 2016-01-01 1 King 1000 1000
1 2016-01-02 1 King -200 800
2 2016-01-03 1 King 100 900
3 2016-01-04 1 King -400 500
4 2016-01-05 1 King 200 700
5 2016-01-06 1 King -200 500
6 2016-01-01 2 Smith 1000 1000
7 2016-01-02 2 Smith -300 700
8 2016-01-03 2 Smith -600 100
9 2016-01-04 2 Smith 100 200
10 2016-01-05 2 Smith -100 100
有table:
list_1= [['2016-01-01',1,'King', 1000],
['2016-01-02',1,'King', -200],
['2016-01-03',1,'King', 100],
['2016-01-04',1,'King',-400],
['2016-01-05',1,'King', 200],
['2016-01-06',1,'King', -200],
['2016-01-01',2,'Smith', 1000],
['2016-01-02',2,'Smith', -300],
['2016-01-03',2,'Smith', -600],
['2016-01-04',2,'Smith', 100],
['2016-01-05',2,'Smith', -100]]
labels=['a_date','c_id','c_name','c_action']
df=pd.DataFrame(list_1,columns=labels)
df
输出:
a_date c_id c_name c_action
0 2016-01-01 1 King 1000
1 2016-01-02 1 King -200
2 2016-01-03 1 King 100
3 2016-01-04 1 King -400
4 2016-01-05 1 King 200
5 2016-01-06 1 King -200
6 2016-01-01 2 Smith 1000
7 2016-01-02 2 Smith -300
8 2016-01-03 2 Smith -600
9 2016-01-04 2 Smith 100
10 2016-01-05 2 Smith -100
需要得到table:
a_date c_id c_name c_amount Balance
2016-01-01 1 King 1000 1000
2016-01-02 1 King -200 800
2016-01-03 1 King 100 900
2016-01-04 1 King -400 500
2016-01-05 1 King 200 700
2016-01-06 1 King -200 500
2016-01-01 2 Smith 1000 1000
2016-01-02 2 Smith -300 700
2016-01-03 2 Smith -600 100
2016-01-04 2 Smith 100 200
2016-01-05 2 Smith -100 100
所以我需要在每个客户的每次操作后制作 "Balance" 列,其中包含累计金额。 这相当于 SQL 查询:
SELECT *,
SUM(c_amount) OVER (PARTITION BY c_id ORDER BY a_date) AS 'Balance'
FROM account_actions
对于两个客户的解决方案都不难,可以将table除以c_id,总结并合并back.But应该是10000个客户的动态解决方案...
正如@Vaishali 评论的那样,这是 groupby
和 cumsum
。您可能想要执行 sort_values
以确保数据按顺序排序,尽管它看起来已经如此:
# sort by `c_id` and `a_date`
df = df.sort_values(['c_id','a_date'])
df['balance'] = df.groupby('c_id')['c_action'].cumsum()
输出:
a_date c_id c_name c_action balance
0 2016-01-01 1 King 1000 1000
1 2016-01-02 1 King -200 800
2 2016-01-03 1 King 100 900
3 2016-01-04 1 King -400 500
4 2016-01-05 1 King 200 700
5 2016-01-06 1 King -200 500
6 2016-01-01 2 Smith 1000 1000
7 2016-01-02 2 Smith -300 700
8 2016-01-03 2 Smith -600 100
9 2016-01-04 2 Smith 100 200
10 2016-01-05 2 Smith -100 100