在单行中汇总 pandas 个数据框
Summarising pandas dataframe in single row
我希望得到一些帮助,将下面详述的数据框汇总成一行摘要,如页面下方所需输出所示。非常感谢。
employees = {'Name of Employee': ['Mark','Mark','Mark','Mark','Mark','Mark', 'Mark','Mark','Mark','Mark','Mark','Mark','Mark'],
'Department': ['21','21','21','21','21','21', '21','21','21','21','21','21','21'],
'Team': ['2','2','2','2','2','2','2','2','2','2','2','2','2'],
'Log': ['2020-02-19 09:01:17', '2020-02-19 09:54:02', '2020-04-10 11:00:31', '2020-04-11 12:39:08', '2020-04-18 09:45:22', '2020-05-05 09:01:17', '2020-05-23 09:54:02', '2020-07-03 11:00:31', '2020-07-03 12:39:08', '2020-07-04 09:45:22', '2020-07-05 09:01:17', '2020-07-06 09:54:02', '2020-07-06 11:00:31'],
'Call Duration' : ['0.01178', '0.01736','0.01923','0.00911','0.01007','0.01206','0.01256','0.01006','0.01162','0.00733','0.01250','0.01013','0.01308'],
'ITT': ['NO','YES', 'NO', 'Follow up', 'YES','YES', 'NO', 'Follow up','YES','YES', 'NO','YES','YES']
}
df = pd.DataFrame(employees)
期望的输出:
Name Dept Team Start End Weeks Total Calls Ave. Call time Sold Rejected more info
Mark 21 2 2020-02-19 2020-07-06 19.71 13 0.01207 7 4 2
我寻求应用的逻辑是(虽然我猜我下面写的语法有错误,但我希望你仍然能够理解计算):
- 开始 = df 中的最小日期['Log']
- 结束 = df 中的最大日期['Log']
- 周数 = (df['log'] 中的最大日期 - df['Log'] 中的最小日期)/7
- 总调用次数 = df['Log'].count
- 大道。调用时间 = (df['Call Duration'].sum)/(df['Log'].count)
- 已售出 = (df['ITT']=='YES').count
- 拒绝 = (df['ITT']=='NO').count
- 更多信息 = (df['ITT']=='Follow up').count
你有语法错误,你忘记在每个键的末尾加上逗号。
现在你可以处理这个数据框了。
import pandas as pd
employees = {'Name=': ['Mark','Mark','Mark','Mark','Mark','Mark', 'Mark','Mark','Mark','Mark','Mark','Mark','Mark'],
'Department': ['21','21','21','21','21','21', '21','21','21','21','21','21','21'],
'Team': ['2','2','2','2','2','2','2','2','2','2','2','2','2'],
'Log': ['2020-02-19 09:01:17', '2020-02-19 09:54:02', '2020-04-10 11:00:31', '2020-04-11 12:39:08', '2020-04-18 09:45:22', '2020-05-05 09:01:17', '2020-05-23 09:54:02', '2020-07-03 11:00:31', '2020-07-03 12:39:08', '2020-07-04 09:45:22', '2020-07-05 09:01:17', '2020-07-06 09:54:02', '2020-07-06 11:00:31'],
'Call Duration' : ['0.01178', '0.01736','0.01923','0.00911','0.01007','0.01206','0.01256','0.01006','0.01162','0.00733','0.01250','0.01013','0.01308'],
'ITT': ['NO','YES', 'NO', 'Follow up', 'YES','YES', 'NO', 'Follow up','YES','YES', 'NO','YES','YES']
}
df = pd.DataFrame(employees)
print(df)
输出:-
Name Department ... Call Duration ITT
Mark 21 ... 0.01178 NO
Mark 21 ... 0.01736 YES
Mark 21 ... 0.01923 NO
Mark 21 ... 0.00911 Follow up
Mark 21 ... 0.01007 YES
Mark 21 ... 0.01206 YES
Mark 21 ... 0.01256 NO
Mark 21 ... 0.01006 Follow up
Mark 21 ... 0.01162 YES
Mark 21 ... 0.00733 YES
Mark 21 ... 0.01250 NO
Mark 21 ... 0.01013 YES
Mark 21 ... 0.01308 YES
[13 rows x 6 columns]
尝试使用 pd.NamedAgg
和 groupby
:
df['Log'] = pd.to_datetime(df['Log'])
df['Call Duration'] = df['Call Duration'].astype(float)
df.groupby(['Name of Employee', 'Team', 'Department'])\
.agg(Start = ('Log','min'),
End = ('Log', 'max'),
Weeks = ('Log', lambda x: np.ptp(x) / np.timedelta64(7, 'D')),
Total_Calls = ('Log', 'count'),
Avg_Call_Time = ('Call Duration', 'mean'),
Sold = ('ITT', lambda x: (x == 'YES').sum()),
Rejected = ('ITT', lambda x: (x == 'NO').sum()),
More_info = ('ITT', lambda x: (x=='Follow up').sum()))
输出:
Start End Weeks Total_Calls Avg_Call_Time Sold Rejected More_info
Name of Employee Team Department
Mark 2 21 2020-02-19 09:01:17 2020-07-06 11:00:31 19.726114 13 0.012068 7 4 2
我希望得到一些帮助,将下面详述的数据框汇总成一行摘要,如页面下方所需输出所示。非常感谢。
employees = {'Name of Employee': ['Mark','Mark','Mark','Mark','Mark','Mark', 'Mark','Mark','Mark','Mark','Mark','Mark','Mark'],
'Department': ['21','21','21','21','21','21', '21','21','21','21','21','21','21'],
'Team': ['2','2','2','2','2','2','2','2','2','2','2','2','2'],
'Log': ['2020-02-19 09:01:17', '2020-02-19 09:54:02', '2020-04-10 11:00:31', '2020-04-11 12:39:08', '2020-04-18 09:45:22', '2020-05-05 09:01:17', '2020-05-23 09:54:02', '2020-07-03 11:00:31', '2020-07-03 12:39:08', '2020-07-04 09:45:22', '2020-07-05 09:01:17', '2020-07-06 09:54:02', '2020-07-06 11:00:31'],
'Call Duration' : ['0.01178', '0.01736','0.01923','0.00911','0.01007','0.01206','0.01256','0.01006','0.01162','0.00733','0.01250','0.01013','0.01308'],
'ITT': ['NO','YES', 'NO', 'Follow up', 'YES','YES', 'NO', 'Follow up','YES','YES', 'NO','YES','YES']
}
df = pd.DataFrame(employees)
期望的输出:
Name Dept Team Start End Weeks Total Calls Ave. Call time Sold Rejected more info
Mark 21 2 2020-02-19 2020-07-06 19.71 13 0.01207 7 4 2
我寻求应用的逻辑是(虽然我猜我下面写的语法有错误,但我希望你仍然能够理解计算):
- 开始 = df 中的最小日期['Log']
- 结束 = df 中的最大日期['Log']
- 周数 = (df['log'] 中的最大日期 - df['Log'] 中的最小日期)/7
- 总调用次数 = df['Log'].count
- 大道。调用时间 = (df['Call Duration'].sum)/(df['Log'].count)
- 已售出 = (df['ITT']=='YES').count
- 拒绝 = (df['ITT']=='NO').count
- 更多信息 = (df['ITT']=='Follow up').count
你有语法错误,你忘记在每个键的末尾加上逗号。 现在你可以处理这个数据框了。
import pandas as pd
employees = {'Name=': ['Mark','Mark','Mark','Mark','Mark','Mark', 'Mark','Mark','Mark','Mark','Mark','Mark','Mark'],
'Department': ['21','21','21','21','21','21', '21','21','21','21','21','21','21'],
'Team': ['2','2','2','2','2','2','2','2','2','2','2','2','2'],
'Log': ['2020-02-19 09:01:17', '2020-02-19 09:54:02', '2020-04-10 11:00:31', '2020-04-11 12:39:08', '2020-04-18 09:45:22', '2020-05-05 09:01:17', '2020-05-23 09:54:02', '2020-07-03 11:00:31', '2020-07-03 12:39:08', '2020-07-04 09:45:22', '2020-07-05 09:01:17', '2020-07-06 09:54:02', '2020-07-06 11:00:31'],
'Call Duration' : ['0.01178', '0.01736','0.01923','0.00911','0.01007','0.01206','0.01256','0.01006','0.01162','0.00733','0.01250','0.01013','0.01308'],
'ITT': ['NO','YES', 'NO', 'Follow up', 'YES','YES', 'NO', 'Follow up','YES','YES', 'NO','YES','YES']
}
df = pd.DataFrame(employees)
print(df)
输出:-
Name Department ... Call Duration ITT
Mark 21 ... 0.01178 NO
Mark 21 ... 0.01736 YES
Mark 21 ... 0.01923 NO
Mark 21 ... 0.00911 Follow up
Mark 21 ... 0.01007 YES
Mark 21 ... 0.01206 YES
Mark 21 ... 0.01256 NO
Mark 21 ... 0.01006 Follow up
Mark 21 ... 0.01162 YES
Mark 21 ... 0.00733 YES
Mark 21 ... 0.01250 NO
Mark 21 ... 0.01013 YES
Mark 21 ... 0.01308 YES
[13 rows x 6 columns]
尝试使用 pd.NamedAgg
和 groupby
:
df['Log'] = pd.to_datetime(df['Log'])
df['Call Duration'] = df['Call Duration'].astype(float)
df.groupby(['Name of Employee', 'Team', 'Department'])\
.agg(Start = ('Log','min'),
End = ('Log', 'max'),
Weeks = ('Log', lambda x: np.ptp(x) / np.timedelta64(7, 'D')),
Total_Calls = ('Log', 'count'),
Avg_Call_Time = ('Call Duration', 'mean'),
Sold = ('ITT', lambda x: (x == 'YES').sum()),
Rejected = ('ITT', lambda x: (x == 'NO').sum()),
More_info = ('ITT', lambda x: (x=='Follow up').sum()))
输出:
Start End Weeks Total_Calls Avg_Call_Time Sold Rejected More_info
Name of Employee Team Department
Mark 2 21 2020-02-19 09:01:17 2020-07-06 11:00:31 19.726114 13 0.012068 7 4 2