Pandas 具有多列的Dataframe Groupby并求和
Pandas Dataframe Groupby with multiple columns and sum it
我有一个数据框 df,我需要根据条件对多列进行分组。
输入
Employee Name
Subjects
Billable
Hours
Date
Anu
Java
Yes
8
01-03-2021
Anu
Python
Yes
9
02-03-2021
Anu
SQL
No
6
03-03-2021
Anu
React
Yes
5
03-03-2021
Anu
.Net
No
8
04-03-2021
Bala
SQL
No
5
01-03-2021
Bala
Python
Yes
4
01-03-2021
Bala
Java
Yes
2
02-03-2021
Bala
.Net
No
8
03-03-2021
Bala
React
Yes
7
04-03-2021
应在输出文件中添加列:利用率、可计费利用率、不可计费利用率
- 利用率列的计算:利用率 = df[小时]/9*100
-> 如果可计费列值为“YES”- 创建列作为可计费利用率
- 列中计费利用率的计算:计费利用率 = df[小时]/9 * 100
-> 如果可计费列值为“NO”- 创建列为不可计费利用率
- 列中非计费利用率的计算:非计费利用率 = df[小时]/9*100
代码
available =9
utilization= (df['Hours']/available*100).round(2)
df = df.assign(Utilization = utilization)
df1 = df.groupby(['Date','Employee Name'])['Utilization'].sum()
df['Billable'] = np.where(df['Billable'] == 'Yes', 'Billable Utilization','Non Billable Utilization')
df2 = (df.groupby(['Date','Employee Name']).agg({'Hours':sum}).div(available).mul(100)).round(2)
预期输出
我的输出
您可以使用 pivot_table()
将 Hours
“透视”到 Billable
和 Non-Billable
列,然后转换为利用率:
# sample data
df = pd.DataFrame({'Employee Name':{0:'Anu',1:'Anu',2:'Anu',3:'Anu',4:'Anu',5:'Bala',6:'Bala',7:'Bala',8:'Bala',9:'Bala'},'Subjects':{0:'Java',1:'Python',2:'SQL',3:'React',4:'.Net',5:'SQL',6:'Python',7:'Java',8:'.Net',9:'React'},'Billable':{0:'Yes',1:'Yes',2:'No',3:'Yes',4:'No',5:'No',6:'Yes',7:'Yes',8:'No',9:'Yes'},'Hours':{0:8,1:9,2:6,3:5,4:8,5:5,6:4,7:2,8:8,9:7},'Date':{0:'01-03-2021',1:'02-03-2021',2:'03-03-2021',3:'03-03-2021',4:'04-03-2021',5:'01-03-2021',6:'01-03-2021',7:'02-03-2021',8:'03-03-2021',9:'04-03-2021'}})
# pivot hours into (non)billable
df = df.pivot_table(index=['Date', 'Employee Name'], columns='Billable').reset_index()
df.columns = ['Date', 'Employee Name', 'Non-Billable', 'Billable']
# compute utilization
df['Non-Billable'] = df['Non-Billable'].div(9).mul(100).round(2)
df['Billable'] = df['Billable'].div(9).mul(100).round(2)
df['Utilization'] = df[['Billable', 'Non-Billable']].fillna(0).sum(axis=1)
# Date Employee Name Non-Billable Billable Utilization
# 0 01-03-2021 Anu NaN 88.89 88.89
# 1 01-03-2021 Bala 55.56 44.44 100.00
# 2 02-03-2021 Anu NaN 100.00 100.00
# 3 02-03-2021 Bala NaN 22.22 22.22
# 4 03-03-2021 Anu 66.67 55.56 122.23
# 5 03-03-2021 Bala 88.89 NaN 88.89
# 6 04-03-2021 Anu 88.89 NaN 88.89
# 7 04-03-2021 Bala NaN 77.78 77.78
我有一个数据框 df,我需要根据条件对多列进行分组。
输入
Employee Name | Subjects | Billable | Hours | Date |
---|---|---|---|---|
Anu | Java | Yes | 8 | 01-03-2021 |
Anu | Python | Yes | 9 | 02-03-2021 |
Anu | SQL | No | 6 | 03-03-2021 |
Anu | React | Yes | 5 | 03-03-2021 |
Anu | .Net | No | 8 | 04-03-2021 |
Bala | SQL | No | 5 | 01-03-2021 |
Bala | Python | Yes | 4 | 01-03-2021 |
Bala | Java | Yes | 2 | 02-03-2021 |
Bala | .Net | No | 8 | 03-03-2021 |
Bala | React | Yes | 7 | 04-03-2021 |
应在输出文件中添加列:利用率、可计费利用率、不可计费利用率
- 利用率列的计算:利用率 = df[小时]/9*100
-> 如果可计费列值为“YES”- 创建列作为可计费利用率
- 列中计费利用率的计算:计费利用率 = df[小时]/9 * 100
-> 如果可计费列值为“NO”- 创建列为不可计费利用率
- 列中非计费利用率的计算:非计费利用率 = df[小时]/9*100
代码
available =9
utilization= (df['Hours']/available*100).round(2)
df = df.assign(Utilization = utilization)
df1 = df.groupby(['Date','Employee Name'])['Utilization'].sum()
df['Billable'] = np.where(df['Billable'] == 'Yes', 'Billable Utilization','Non Billable Utilization')
df2 = (df.groupby(['Date','Employee Name']).agg({'Hours':sum}).div(available).mul(100)).round(2)
预期输出
我的输出
您可以使用 pivot_table()
将 Hours
“透视”到 Billable
和 Non-Billable
列,然后转换为利用率:
# sample data
df = pd.DataFrame({'Employee Name':{0:'Anu',1:'Anu',2:'Anu',3:'Anu',4:'Anu',5:'Bala',6:'Bala',7:'Bala',8:'Bala',9:'Bala'},'Subjects':{0:'Java',1:'Python',2:'SQL',3:'React',4:'.Net',5:'SQL',6:'Python',7:'Java',8:'.Net',9:'React'},'Billable':{0:'Yes',1:'Yes',2:'No',3:'Yes',4:'No',5:'No',6:'Yes',7:'Yes',8:'No',9:'Yes'},'Hours':{0:8,1:9,2:6,3:5,4:8,5:5,6:4,7:2,8:8,9:7},'Date':{0:'01-03-2021',1:'02-03-2021',2:'03-03-2021',3:'03-03-2021',4:'04-03-2021',5:'01-03-2021',6:'01-03-2021',7:'02-03-2021',8:'03-03-2021',9:'04-03-2021'}})
# pivot hours into (non)billable
df = df.pivot_table(index=['Date', 'Employee Name'], columns='Billable').reset_index()
df.columns = ['Date', 'Employee Name', 'Non-Billable', 'Billable']
# compute utilization
df['Non-Billable'] = df['Non-Billable'].div(9).mul(100).round(2)
df['Billable'] = df['Billable'].div(9).mul(100).round(2)
df['Utilization'] = df[['Billable', 'Non-Billable']].fillna(0).sum(axis=1)
# Date Employee Name Non-Billable Billable Utilization
# 0 01-03-2021 Anu NaN 88.89 88.89
# 1 01-03-2021 Bala 55.56 44.44 100.00
# 2 02-03-2021 Anu NaN 100.00 100.00
# 3 02-03-2021 Bala NaN 22.22 22.22
# 4 03-03-2021 Anu 66.67 55.56 122.23
# 5 03-03-2021 Bala 88.89 NaN 88.89
# 6 04-03-2021 Anu 88.89 NaN 88.89
# 7 04-03-2021 Bala NaN 77.78 77.78