如何在 GroupBy 和连续日期条件下对值求和?
How to sum values under GroupBy and consecutive date conditions?
给定 table:
ID
LINE
SITE
DATE
UNITS
TOTAL
1
X
AAA
02-May-2017
12
30
2
X
AAA
03-May-2017
10
22
3
X
AAA
04-May-2017
22
40
4
Z
AAA
20-MAY-2017
15
44
5
Z
AAA
21-May-2017
8
30
6
Z
BBB
22-May-2017
10
32
7
Z
BBB
23-May-2017
25
52
8
K
CCC
02-Jun-2017
6
22
9
K
CCC
03-Jun-2017
4
33
10
K
CCC
12-Aug-2017
11
44
11
K
CCC
13-Aug-2017
19
40
12
K
CCC
14-Aug-2017
30
40
对于每一行,如果 ID、LINE、SITE 等于前一行(天)需要计算如下(最后一天)和(最后 3 天):
请注意,需要确保日期在 ID、LINE、SITE 列的“groupby”下是连续的
ID
LINE
SITE
DATE
UNITS
TOTAL
Last day
Last 3 days
1
X
AAA
02-May-2017
12
30
0
0
2
X
AAA
03-May-2017
10
22
12/30
12/30
3
X
AAA
04-May-2017
22
40
10/22
(10+12)/(30+22)
4
Z
AAA
20-MAY-2017
15
44
0
0
5
Z
AAA
21-May-2017
8
30
15/44
15/44
6
Z
BBB
22-May-2017
10
32
0
0
7
Z
BBB
23-May-2017
25
52
10/32
10/32
8
K
CCC
02-Jun-2017
6
22
0
0
9
K
CCC
03-Jun-2017
4
33
6/22
6/22
10
K
CCC
12-Aug-2017
11
44
4/33
0
11
K
CCC
13-Aug-2017
19
40
11/44
(11/44)
12
K
CCC
14-Aug-2017
30
40
19/40
(11+19/44+40)
在这种情况下,我通常使用 groupby 进行 for 循环:
import pandas as pd
import numpy as np
#copied your table
table = pd.read_csv('/home/fm/Desktop/stackover.csv')
table.set_index('ID', inplace = True)
table[['Last day','Last 3 days']] = np.nan
for i,r in table.groupby(['LINE' ,'SITE']):
#First subset non sequential dates
limits_interval = pd.to_datetime(r['DATE']).diff() != '1 days'
#First element is a false positive, as its impossible to calculate past days from first day
limits_interval.iloc[0]=False
ids_subset = r.index[limits_interval].to_list()
ids_subset.append(r.index[-1]+1) #to consider all values
id_start = 0
for id_end in ids_subset:
r_sub = r.loc[id_start:id_end-1, :].copy()
id_start = id_end
#move all values one day off, if the database is as in your example (1 line per day) wont have problems
r_shifted = r_sub.shift(1)
r_sub['Last day']=r_shifted['UNITS']/r_shifted['TOTAL']
aux_units_cumsum = r_shifted['UNITS'].cumsum()
aux_total_cumsum = r_shifted['TOTAL'].cumsum()
r_sub['Last 3 days'] = aux_units_cumsum/aux_total_cumsum
r_sub.fillna(0, inplace = True)
table.loc[r_sub.index,:]=r_sub.copy()
你可以做一个函数在groupby中应用,这样会更干净:Apply function to pandas groupby。它会更优雅。
希望能帮到你,祝你好运
给定 table:
ID | LINE | SITE | DATE | UNITS | TOTAL |
---|---|---|---|---|---|
1 | X | AAA | 02-May-2017 | 12 | 30 |
2 | X | AAA | 03-May-2017 | 10 | 22 |
3 | X | AAA | 04-May-2017 | 22 | 40 |
4 | Z | AAA | 20-MAY-2017 | 15 | 44 |
5 | Z | AAA | 21-May-2017 | 8 | 30 |
6 | Z | BBB | 22-May-2017 | 10 | 32 |
7 | Z | BBB | 23-May-2017 | 25 | 52 |
8 | K | CCC | 02-Jun-2017 | 6 | 22 |
9 | K | CCC | 03-Jun-2017 | 4 | 33 |
10 | K | CCC | 12-Aug-2017 | 11 | 44 |
11 | K | CCC | 13-Aug-2017 | 19 | 40 |
12 | K | CCC | 14-Aug-2017 | 30 | 40 |
对于每一行,如果 ID、LINE、SITE 等于前一行(天)需要计算如下(最后一天)和(最后 3 天): 请注意,需要确保日期在 ID、LINE、SITE 列的“groupby”下是连续的
ID | LINE | SITE | DATE | UNITS | TOTAL | Last day | Last 3 days |
---|---|---|---|---|---|---|---|
1 | X | AAA | 02-May-2017 | 12 | 30 | 0 | 0 |
2 | X | AAA | 03-May-2017 | 10 | 22 | 12/30 | 12/30 |
3 | X | AAA | 04-May-2017 | 22 | 40 | 10/22 | (10+12)/(30+22) |
4 | Z | AAA | 20-MAY-2017 | 15 | 44 | 0 | 0 |
5 | Z | AAA | 21-May-2017 | 8 | 30 | 15/44 | 15/44 |
6 | Z | BBB | 22-May-2017 | 10 | 32 | 0 | 0 |
7 | Z | BBB | 23-May-2017 | 25 | 52 | 10/32 | 10/32 |
8 | K | CCC | 02-Jun-2017 | 6 | 22 | 0 | 0 |
9 | K | CCC | 03-Jun-2017 | 4 | 33 | 6/22 | 6/22 |
10 | K | CCC | 12-Aug-2017 | 11 | 44 | 4/33 | 0 |
11 | K | CCC | 13-Aug-2017 | 19 | 40 | 11/44 | (11/44) |
12 | K | CCC | 14-Aug-2017 | 30 | 40 | 19/40 | (11+19/44+40) |
在这种情况下,我通常使用 groupby 进行 for 循环:
import pandas as pd
import numpy as np
#copied your table
table = pd.read_csv('/home/fm/Desktop/stackover.csv')
table.set_index('ID', inplace = True)
table[['Last day','Last 3 days']] = np.nan
for i,r in table.groupby(['LINE' ,'SITE']):
#First subset non sequential dates
limits_interval = pd.to_datetime(r['DATE']).diff() != '1 days'
#First element is a false positive, as its impossible to calculate past days from first day
limits_interval.iloc[0]=False
ids_subset = r.index[limits_interval].to_list()
ids_subset.append(r.index[-1]+1) #to consider all values
id_start = 0
for id_end in ids_subset:
r_sub = r.loc[id_start:id_end-1, :].copy()
id_start = id_end
#move all values one day off, if the database is as in your example (1 line per day) wont have problems
r_shifted = r_sub.shift(1)
r_sub['Last day']=r_shifted['UNITS']/r_shifted['TOTAL']
aux_units_cumsum = r_shifted['UNITS'].cumsum()
aux_total_cumsum = r_shifted['TOTAL'].cumsum()
r_sub['Last 3 days'] = aux_units_cumsum/aux_total_cumsum
r_sub.fillna(0, inplace = True)
table.loc[r_sub.index,:]=r_sub.copy()
你可以做一个函数在groupby中应用,这样会更干净:Apply function to pandas groupby。它会更优雅。 希望能帮到你,祝你好运