如何计算每个员工每天每次进出的时间差总和?
How to calculate the sum of time difference for every entry and exit per employee for each day?
我正在使用这个数据框,每个员工都有一个唯一的 ID,在 E/X 列中,6 代表他进入的时间,1 代表他离开的时间
Emp E/X DateTime Date Time
107 6 2022-01-04 10:04:18 0 2022-01-04 10:04:18
107 6 2022-01-04 11:32:52 0 2022-01-04 11:32:52
107 6 2022-01-04 11:39:59 0 2022-01-04 11:39:59
107 1 2022-01-04 12:05:26 0 2022-01-04 12:05:26
107 6 2022-01-04 18:02:18 0 2022-01-04 18:02:18
107 6 2022-01-04 18:30:38 0 2022-01-04 18:30:38
107 1 2022-01-04 19:06:58 0 2022-01-04 19:06:58
107 1 2022-01-05 12:22:10 0 2022-01-05 12:22:10
107 6 2022-01-05 19:22:15 0 2022-01-05 19:22:15
122 1 2022-01-03 08:57:40 0 2022-01-03 08:57:40
122 6 2022-01-03 12:49:33 0 2022-01-03 12:49:33
122 1 2022-01-03 13:22:28 0 2022-01-03 13:22:28
122 6 2022-01-03 16:29:51 0 2022-01-03 16:29:51
122 1 2022-01-03 16:40:06 0 2022-01-03 16:40:06
我想知道是否可以计算员工每天的工作量并更改 E/X 列,以便每天都有一个连续的 in/out 因为它有错误,有时会有多个条目例如,我将依次取前两行并将第二行更改为 exit :
Emp E/X DateTime Date Time
107 6 2022-01-04 10:04:18 0 2022-01-04 10:04:18
107 1 2022-01-04 11:32:52 0 2022-01-04 11:32:52
122 6 2022-01-03 08:57:40 0 2022-01-03 08:57:40
122 1 2022-01-03 12:49:33 0 2022-01-03 12:49:33
122 6 2022-01-03 13:22:28 0 2022-01-03 13:22:28
122 1 2022-01-03 16:29:51 0 2022-01-03 16:29:51 this line is going to be deleted
122 1 2022-01-03 16:40:06 0 2022-01-03 16:40:06
想要的结果:
Emp E/X DateTime Date Time
107 6 2022-01-04 10:04:18 0 2022-01-04 10:04:18
107 1 2022-01-04 11:32:52 0 2022-01-04 11:32:52
107 6 2022-01-04 11:39:59 0 2022-01-04 11:39:59
107 1 2022-01-04 12:05:26 0 2022-01-04 12:05:26
107 6 2022-01-04 18:02:18 0 2022-01-04 18:02:18
107 1 2022-01-04 19:06:58 0 2022-01-04 19:06:58
107 6 2022-01-05 12:22:10 0 2022-01-05 12:22:10
107 1 2022-01-05 19:22:15 0 2022-01-05 19:22:15
122 6 2022-01-03 08:57:40 0 2022-01-03 08:57:40
122 1 2022-01-03 12:49:33 0 2022-01-03 12:49:33
122 6 2022-01-03 13:22:28 0 2022-01-03 13:22:28
122 1 2022-01-03 16:40:06 0 2022-01-03 16:40:06
一旦 E/X 固定,我想计算每个员工每天 6 和 1 之间的每个差异的总和
想要的结果:
EMP Date WorkHours
4 107 2022-01-03 2
5 107 2022-01-04 8
6 122 2022-01-03 4
将使用我自己的测试数据并假设它已经干净了,即交替 start/end 日期时间
设置
df = pd.concat(
[
pd.DataFrame(
{
"employee":[107]*6,
"E/X":[6,1,6,1,6,1],
"datetime":pd.Timestamp("2022") + pd.Series([0,4,14,26,40,50]).apply(pd.Timedelta, unit="hours")
}
),
pd.DataFrame(
{
"employee":[122]*8,
"E/X":[6,1,6,1,6,1,6,1],
"datetime":pd.Timestamp("2022") + pd.Series([3,20,30,35,45,55,56,60]).apply(pd.Timedelta, unit="hours")
}
),
]
).reset_index(drop=True)
df
看起来像这样
employee E/X datetime
0 107 6 2022-01-01 00:00:00
1 107 1 2022-01-01 04:00:00
2 107 6 2022-01-01 14:00:00
3 107 1 2022-01-02 02:00:00
4 107 6 2022-01-02 16:00:00
5 107 1 2022-01-03 02:00:00
6 122 6 2022-01-01 03:00:00
7 122 1 2022-01-01 20:00:00
8 122 6 2022-01-02 06:00:00
9 122 1 2022-01-02 11:00:00
10 122 6 2022-01-02 21:00:00
11 122 1 2022-01-03 07:00:00
12 122 6 2022-01-03 08:00:00
13 122 1 2022-01-03 12:00:00
解决方案
这将使用一个名为 piso (pandas interval set operations). In particular it will follow the last example of piso.coverage
的包
# create day range required and convert to interval index
days = pd.date_range("2022", freq="D", periods=4)
day_intervals = pd.IntervalIndex.from_breaks(days)
# creates an interval index from start and end times for an employee
# then calculates the sum of each interval for each bin in day_intervals
def calc_employee(d):
ii = pd.IntervalIndex.from_arrays(d.loc[d["E/X"] == 6, "datetime"], d.loc[d["E/X"] == 1, "datetime"])
return piso.coverage(ii, domain=day_intervals, bins=True, how="sum")
# apply the function for each employee
hours_worked = df.groupby("employee").apply(calc_employee)
# columns will be day_intervals, let's change it to the start of each day
hours_worked.columns = hours_worked.columns.left
# melt it into tidy data format
hours_worked = hours_worked.melt(var_name="date", value_name="timedelta", ignore_index=False).reset_index()
# calculate hours from timedelta value (optional)
hours_worked["hours"] = hours_worked["timedelta"]/pd.Timedelta("1hr")
hours_worked
看起来像这样:
employee date timedelta hours
0 107 2022-01-01 0 days 14:00:00 14.0
1 122 2022-01-01 0 days 17:00:00 17.0
2 107 2022-01-02 0 days 10:00:00 10.0
3 122 2022-01-02 0 days 08:00:00 8.0
4 107 2022-01-03 0 days 02:00:00 2.0
5 122 2022-01-03 0 days 11:00:00 11.0
我正在使用这个数据框,每个员工都有一个唯一的 ID,在 E/X 列中,6 代表他进入的时间,1 代表他离开的时间
Emp E/X DateTime Date Time
107 6 2022-01-04 10:04:18 0 2022-01-04 10:04:18
107 6 2022-01-04 11:32:52 0 2022-01-04 11:32:52
107 6 2022-01-04 11:39:59 0 2022-01-04 11:39:59
107 1 2022-01-04 12:05:26 0 2022-01-04 12:05:26
107 6 2022-01-04 18:02:18 0 2022-01-04 18:02:18
107 6 2022-01-04 18:30:38 0 2022-01-04 18:30:38
107 1 2022-01-04 19:06:58 0 2022-01-04 19:06:58
107 1 2022-01-05 12:22:10 0 2022-01-05 12:22:10
107 6 2022-01-05 19:22:15 0 2022-01-05 19:22:15
122 1 2022-01-03 08:57:40 0 2022-01-03 08:57:40
122 6 2022-01-03 12:49:33 0 2022-01-03 12:49:33
122 1 2022-01-03 13:22:28 0 2022-01-03 13:22:28
122 6 2022-01-03 16:29:51 0 2022-01-03 16:29:51
122 1 2022-01-03 16:40:06 0 2022-01-03 16:40:06
我想知道是否可以计算员工每天的工作量并更改 E/X 列,以便每天都有一个连续的 in/out 因为它有错误,有时会有多个条目例如,我将依次取前两行并将第二行更改为 exit :
Emp E/X DateTime Date Time
107 6 2022-01-04 10:04:18 0 2022-01-04 10:04:18
107 1 2022-01-04 11:32:52 0 2022-01-04 11:32:52
122 6 2022-01-03 08:57:40 0 2022-01-03 08:57:40
122 1 2022-01-03 12:49:33 0 2022-01-03 12:49:33
122 6 2022-01-03 13:22:28 0 2022-01-03 13:22:28
122 1 2022-01-03 16:29:51 0 2022-01-03 16:29:51 this line is going to be deleted
122 1 2022-01-03 16:40:06 0 2022-01-03 16:40:06
想要的结果:
Emp E/X DateTime Date Time
107 6 2022-01-04 10:04:18 0 2022-01-04 10:04:18
107 1 2022-01-04 11:32:52 0 2022-01-04 11:32:52
107 6 2022-01-04 11:39:59 0 2022-01-04 11:39:59
107 1 2022-01-04 12:05:26 0 2022-01-04 12:05:26
107 6 2022-01-04 18:02:18 0 2022-01-04 18:02:18
107 1 2022-01-04 19:06:58 0 2022-01-04 19:06:58
107 6 2022-01-05 12:22:10 0 2022-01-05 12:22:10
107 1 2022-01-05 19:22:15 0 2022-01-05 19:22:15
122 6 2022-01-03 08:57:40 0 2022-01-03 08:57:40
122 1 2022-01-03 12:49:33 0 2022-01-03 12:49:33
122 6 2022-01-03 13:22:28 0 2022-01-03 13:22:28
122 1 2022-01-03 16:40:06 0 2022-01-03 16:40:06
一旦 E/X 固定,我想计算每个员工每天 6 和 1 之间的每个差异的总和
想要的结果:
EMP Date WorkHours
4 107 2022-01-03 2
5 107 2022-01-04 8
6 122 2022-01-03 4
将使用我自己的测试数据并假设它已经干净了,即交替 start/end 日期时间
设置
df = pd.concat(
[
pd.DataFrame(
{
"employee":[107]*6,
"E/X":[6,1,6,1,6,1],
"datetime":pd.Timestamp("2022") + pd.Series([0,4,14,26,40,50]).apply(pd.Timedelta, unit="hours")
}
),
pd.DataFrame(
{
"employee":[122]*8,
"E/X":[6,1,6,1,6,1,6,1],
"datetime":pd.Timestamp("2022") + pd.Series([3,20,30,35,45,55,56,60]).apply(pd.Timedelta, unit="hours")
}
),
]
).reset_index(drop=True)
df
看起来像这样
employee E/X datetime
0 107 6 2022-01-01 00:00:00
1 107 1 2022-01-01 04:00:00
2 107 6 2022-01-01 14:00:00
3 107 1 2022-01-02 02:00:00
4 107 6 2022-01-02 16:00:00
5 107 1 2022-01-03 02:00:00
6 122 6 2022-01-01 03:00:00
7 122 1 2022-01-01 20:00:00
8 122 6 2022-01-02 06:00:00
9 122 1 2022-01-02 11:00:00
10 122 6 2022-01-02 21:00:00
11 122 1 2022-01-03 07:00:00
12 122 6 2022-01-03 08:00:00
13 122 1 2022-01-03 12:00:00
解决方案
这将使用一个名为 piso (pandas interval set operations). In particular it will follow the last example of piso.coverage
的包# create day range required and convert to interval index
days = pd.date_range("2022", freq="D", periods=4)
day_intervals = pd.IntervalIndex.from_breaks(days)
# creates an interval index from start and end times for an employee
# then calculates the sum of each interval for each bin in day_intervals
def calc_employee(d):
ii = pd.IntervalIndex.from_arrays(d.loc[d["E/X"] == 6, "datetime"], d.loc[d["E/X"] == 1, "datetime"])
return piso.coverage(ii, domain=day_intervals, bins=True, how="sum")
# apply the function for each employee
hours_worked = df.groupby("employee").apply(calc_employee)
# columns will be day_intervals, let's change it to the start of each day
hours_worked.columns = hours_worked.columns.left
# melt it into tidy data format
hours_worked = hours_worked.melt(var_name="date", value_name="timedelta", ignore_index=False).reset_index()
# calculate hours from timedelta value (optional)
hours_worked["hours"] = hours_worked["timedelta"]/pd.Timedelta("1hr")
hours_worked
看起来像这样:
employee date timedelta hours
0 107 2022-01-01 0 days 14:00:00 14.0
1 122 2022-01-01 0 days 17:00:00 17.0
2 107 2022-01-02 0 days 10:00:00 10.0
3 122 2022-01-02 0 days 08:00:00 8.0
4 107 2022-01-03 0 days 02:00:00 2.0
5 122 2022-01-03 0 days 11:00:00 11.0