计算午夜后 IN 和 OUT 事件之间的时间增量
Calc of timedelta between IN and OUT events after midnight
下面是沙箱的代码,-原来明显比较大...
主要问题是,-
如果注销是在午夜之后完成的,我无法弄清楚如何计算 timedelta,但是,事件应该被视为发生在前一天...
示例,登录时间为@22/03/2022 18:00:00,注销时间为@23/03/202201:00:00
结果应显示时间差 7 小时,记录日 - 星期二,日期 - 22/03/2022。
目前,我正在考虑创建新的日期列并在其中偏移 4 小时 df['NewDate'] = (pd.to_datetime(df['LogDateTime']) - timedelta(hours=4)).dt.strftime('%d/%m/%Y')
稍后使用新日期拆分和合并,但我不确定这是否是前进的最佳方式...
任何想法或指示都会非常有帮助...
(提前致谢)
import pandas as pd
data = [[515, '2022-03-16 17:01:11', 'In', 'Wed', '16/03/2022'],
[515, '2022-03-17 00:16:36', 'Out', 'Thu', '17/03/2022'],
[515, '2022-03-17 00:16:49', 'Out', 'Thu', '17/03/2022'],
[515, '2022-03-17 16:42:40', 'In', 'Thu', '17/03/2022'],
[515, '2022-03-17 23:48:09', 'Out', 'Thu', '17/03/2022'],
[333, '2022-03-17 17:16:36', 'In', 'Thu', '17/03/2022'],
[333, '2022-03-17 17:16:45', 'In', 'Thu', '17/03/2022'],
[333, '2022-03-17 17:16:51', 'In', 'Thu', '17/03/2022'],
[333, '2022-03-17 22:55:03', 'Out', 'Thu', '17/03/2022']]
df = pd.DataFrame(data, columns=['ID', 'LogDateTime', 'EventType', 'Log Day', 'Date'])
df.LogDateTime = pd.to_datetime(df.LogDateTime)
df.dtypes
df = df.sort_values(by=['LogDateTime', 'ID'])
dfIn = df.loc[(df.EventType == "In")]
dfOut = df.loc[(df.EventType == "Out")]
# dfIn = dfIn.drop_duplicates(subset=['ID', 'Date', 'Log Day'], keep='first') # Cannot be used due to midnight issue...
# dfOut = dfIn.drop_duplicates(subset=['ID', 'Date', 'Log Day'], keep='first') # Cannot be used due to midnight issue...
FinDF = pd.merge(dfIn, dfOut, on=['ID', 'Log Day', 'Date'], how='left')
FinDF.rename(columns={'LogDateTime_x': 'Log In', 'LogDateTime_y': 'Log Out'}, inplace=True)
FinDF['Hours'] = round((FinDF['Log Out'] - FinDF['Log In']).astype('timedelta64[m]') / 60, 2)
FinDF.to_excel(r'C:\Test\test.xlsx', index=False)
根据上面的示例,数据有很多重复事件,我尝试在 Out 事件中使用最大值,在 In 事件中使用相反值,- 但是由于午夜问题,- 我丢失了数据..
最终结果应如下所示:
在合并 DataFrame 之前,当连续出现时,您必须只保留第一个“In”和最后一个“Out”。然后,您可以继续合并两个数据帧并查找时差的现有逻辑。
尝试:
df = df.sort_values(by=['ID','LogDateTime'])
groups = ((df["ID"].eq(df["ID"].shift())&df["EventType"].ne(df["EventType"].shift()))|(df["ID"].ne(df["ID"].shift()))).cumsum()
#minimum timestamp to be used for EventType = "In"
mins = df.groupby(groups)["LogDateTime"].transform("min")
#maximum timestamp to be used for EventType = "In"
maxs = df.groupby(groups)["LogDateTime"].transform("max")
condensed = df[df["LogDateTime"].eq(mins.where(df["EventType"].eq("In"), maxs))].reset_index(drop=True)
condensed["Group"] = condensed.index//2
FinDF = condensed[condensed["EventType"].eq("In")].merge(condensed.loc[condensed["EventType"].eq("Out"), ["LogDateTime", "Group"]], on="Group").drop(["EventType", "Group"], axis=1)
FinDF = FinDF.rename(columns={'LogDateTime_x': 'Log In', 'LogDateTime_y': 'Log Out'})
FinDF['Hours'] = FinDF["Log Out"].sub(FinDF["Log In"]).dt.total_seconds().div(3600).round(2)
>>> FinDF
ID Log In Log Day Date Log Out Hours
0 333 2022-03-17 17:16:36 Thu 17/03/2022 2022-03-17 22:55:03 5.64
1 515 2022-03-16 17:01:11 Wed 16/03/2022 2022-03-17 00:16:49 7.26
2 515 2022-03-17 16:42:40 Thu 17/03/2022 2022-03-17 23:48:09 7.09
下面是沙箱的代码,-原来明显比较大...
主要问题是,- 如果注销是在午夜之后完成的,我无法弄清楚如何计算 timedelta,但是,事件应该被视为发生在前一天...
示例,登录时间为@22/03/2022 18:00:00,注销时间为@23/03/202201:00:00 结果应显示时间差 7 小时,记录日 - 星期二,日期 - 22/03/2022。
目前,我正在考虑创建新的日期列并在其中偏移 4 小时 df['NewDate'] = (pd.to_datetime(df['LogDateTime']) - timedelta(hours=4)).dt.strftime('%d/%m/%Y')
稍后使用新日期拆分和合并,但我不确定这是否是前进的最佳方式...
任何想法或指示都会非常有帮助... (提前致谢)
import pandas as pd
data = [[515, '2022-03-16 17:01:11', 'In', 'Wed', '16/03/2022'],
[515, '2022-03-17 00:16:36', 'Out', 'Thu', '17/03/2022'],
[515, '2022-03-17 00:16:49', 'Out', 'Thu', '17/03/2022'],
[515, '2022-03-17 16:42:40', 'In', 'Thu', '17/03/2022'],
[515, '2022-03-17 23:48:09', 'Out', 'Thu', '17/03/2022'],
[333, '2022-03-17 17:16:36', 'In', 'Thu', '17/03/2022'],
[333, '2022-03-17 17:16:45', 'In', 'Thu', '17/03/2022'],
[333, '2022-03-17 17:16:51', 'In', 'Thu', '17/03/2022'],
[333, '2022-03-17 22:55:03', 'Out', 'Thu', '17/03/2022']]
df = pd.DataFrame(data, columns=['ID', 'LogDateTime', 'EventType', 'Log Day', 'Date'])
df.LogDateTime = pd.to_datetime(df.LogDateTime)
df.dtypes
df = df.sort_values(by=['LogDateTime', 'ID'])
dfIn = df.loc[(df.EventType == "In")]
dfOut = df.loc[(df.EventType == "Out")]
# dfIn = dfIn.drop_duplicates(subset=['ID', 'Date', 'Log Day'], keep='first') # Cannot be used due to midnight issue...
# dfOut = dfIn.drop_duplicates(subset=['ID', 'Date', 'Log Day'], keep='first') # Cannot be used due to midnight issue...
FinDF = pd.merge(dfIn, dfOut, on=['ID', 'Log Day', 'Date'], how='left')
FinDF.rename(columns={'LogDateTime_x': 'Log In', 'LogDateTime_y': 'Log Out'}, inplace=True)
FinDF['Hours'] = round((FinDF['Log Out'] - FinDF['Log In']).astype('timedelta64[m]') / 60, 2)
FinDF.to_excel(r'C:\Test\test.xlsx', index=False)
根据上面的示例,数据有很多重复事件,我尝试在 Out 事件中使用最大值,在 In 事件中使用相反值,- 但是由于午夜问题,- 我丢失了数据..
最终结果应如下所示:
在合并 DataFrame 之前,当连续出现时,您必须只保留第一个“In”和最后一个“Out”。然后,您可以继续合并两个数据帧并查找时差的现有逻辑。
尝试:
df = df.sort_values(by=['ID','LogDateTime'])
groups = ((df["ID"].eq(df["ID"].shift())&df["EventType"].ne(df["EventType"].shift()))|(df["ID"].ne(df["ID"].shift()))).cumsum()
#minimum timestamp to be used for EventType = "In"
mins = df.groupby(groups)["LogDateTime"].transform("min")
#maximum timestamp to be used for EventType = "In"
maxs = df.groupby(groups)["LogDateTime"].transform("max")
condensed = df[df["LogDateTime"].eq(mins.where(df["EventType"].eq("In"), maxs))].reset_index(drop=True)
condensed["Group"] = condensed.index//2
FinDF = condensed[condensed["EventType"].eq("In")].merge(condensed.loc[condensed["EventType"].eq("Out"), ["LogDateTime", "Group"]], on="Group").drop(["EventType", "Group"], axis=1)
FinDF = FinDF.rename(columns={'LogDateTime_x': 'Log In', 'LogDateTime_y': 'Log Out'})
FinDF['Hours'] = FinDF["Log Out"].sub(FinDF["Log In"]).dt.total_seconds().div(3600).round(2)
>>> FinDF
ID Log In Log Day Date Log Out Hours
0 333 2022-03-17 17:16:36 Thu 17/03/2022 2022-03-17 22:55:03 5.64
1 515 2022-03-16 17:01:11 Wed 16/03/2022 2022-03-17 00:16:49 7.26
2 515 2022-03-17 16:42:40 Thu 17/03/2022 2022-03-17 23:48:09 7.09