查找每个工作日的平均事件
find average incidents per business day
我有一个数据集如下:
+----+-------+---------------------+
| ID | SUBID | date |
+----+-------+---------------------+
| A | 1 | 2021-01-01 12:00:00 |
| A | 1 | 2021-01-02 01:00:00 |
| A | 1 | 2021-01-02 02:00:00 |
| A | 1 | 2021-01-03 03:00:00 |
| A | 2 | 2021-01-05 16:00:00 |
| A | 2 | 2021-01-06 13:00:00 |
| A | 2 | 2021-01-07 06:00:00 |
| A | 2 | 2021-01-08 08:00:00 |
| A | 2 | 2021-01-08 10:00:00 |
| A | 2 | 2021-01-08 11:00:00 |
| A | 3 | 2021-01-09 09:00:00 |
| A | 3 | 2021-01-10 19:00:00 |
| A | 3 | 2021-01-11 20:00:00 |
| A | 3 | 2021-01-12 22:00:00 |
| B | 1 | 2021-02-01 23:00:00 |
| B | 1 | 2021-02-02 15:00:00 |
| B | 1 | 2021-02-03 06:00:00 |
| B | 1 | 2021-02-04 08:00:00 |
| B | 2 | 2021-02-05 18:00:00 |
| B | 2 | 2021-02-05 19:00:00 |
| B | 2 | 2021-02-06 22:00:00 |
| B | 2 | 2021-02-06 23:00:00 |
| B | 2 | 2021-02-07 04:00:00 |
| B | 2 | 2021-02-08 02:00:00 |
| B | 3 | 2021-02-09 01:00:00 |
| B | 3 | 2021-02-10 03:00:00 |
| B | 3 | 2021-02-11 13:00:00 |
| B | 3 | 2021-02-12 14:00:00 |
+----+-------+---------------------+
我希望能够得到每个ID和SUBID组之间的时间差,以小时为单位,最好是营业时间,其中每个出现在周末或联邦假日的日期都可以移到一个最近的工作日(之前或之后),时间为 23:59:59,如下所示:
+----+-------+---------------------+------------------------------------------------------------------+
| ID | SUBID | date | timediff (hours) with preceding date for each group (ID, SUBID) |
+----+-------+---------------------+------------------------------------------------------------------+
| A | 1 | 2021-01-01 12:00:00 | 0 |
| A | 1 | 2021-01-02 01:00:00 | 13 |
| A | 1 | 2021-01-02 02:00:00 | 1 |
| A | 1 | 2021-01-03 03:00:00 | 1 |
| A | 2 | 2021-01-05 16:00:00 | 0 |
| A | 2 | 2021-01-06 13:00:00 | 21 |
| A | 2 | 2021-01-07 06:00:00 | 17 |
| A | 2 | 2021-01-08 08:00:00 | 2 |
| A | 2 | 2021-01-08 10:00:00 | 2 |
| A | 2 | 2021-01-08 11:00:00 | 1 |
| A | 3 | 2021-01-09 09:00:00 | 0 |
| A | 3 | 2021-01-10 19:00:00 | 36 |
| A | 3 | 2021-01-11 20:00:00 | 1 |
| A | 3 | 2021-01-12 22:00:00 | 1 |
| B | 1 | 2021-02-01 23:00:00 | 0 |
| B | 1 | 2021-02-02 15:00:00 | 16 |
| B | 1 | 2021-02-03 06:00:00 | 15 |
| B | 1 | 2021-02-04 08:00:00 | 26 |
| B | 2 | 2021-02-05 18:00:00 | 0 |
| B | 2 | 2021-02-05 19:00:00 | 1 |
| B | 2 | 2021-02-06 22:00:00 | 27 |
| B | 2 | 2021-02-06 23:00:00 | 1 |
| B | 2 | 2021-02-07 04:00:00 | 5 |
| B | 2 | 2021-02-08 02:00:00 | 22 |
| B | 3 | 2021-02-09 01:00:00 | 0 |
| B | 3 | 2021-02-10 03:00:00 | 26 |
| B | 3 | 2021-02-11 13:00:00 | 11 |
| B | 3 | 2021-02-12 14:00:00 | 1 |
+----+-------+---------------------+------------------------------------------------------------------+
最后我想计算平均时间,即每组时间差总和(ID、SUBID)除以每组总计数,如下所示:
+----+-------+------------------------------------------------------------+
| ID | SUBID | Average time (count per group/ total time diff of group ) |
+----+-------+------------------------------------------------------------+
| A | 1 | 15/4 |
| A | 2 | 43/6 |
| A | 3 | 38/4 |
| B | 1 | 57/4 |
| B | 2 | 56/6 |
| B | 3 | 38/4 |
+----+-------+------------------------------------------------------------+
我是 R 的新手,我遇到了 lubridate 来帮助我格式化日期,我能够使用下面的代码获得时间差异
df%>%
group_by(ID, SUBID) %>%
mutate(time_diff = difftime(date, lag(date), unit = 'min'))
但是我在获取工作日时间差异以及根据上次 table
获取平均时间时遇到了麻烦
欢迎来到 SO!使用 dplyr
和 lubridate
:
使用的数据:
library(tidyverse)
library(lubridate)
df <- data.frame(ID = c("A","A","A","A"),
SUBID = c(1,1,2,2),
Date = lubridate::as_datetime(c("2021-01-01 12:00:00","2021-01-02 1:00:00","2021-01-01 2:00:00","2021-01-01 13:00:00")))
ID SUBID Date
1 A 1 2021-01-01 12:00:00
2 A 1 2021-01-02 01:00:00
3 A 2 2021-01-01 02:00:00
4 A 2 2021-01-01 13:00:00
代码:
df %>%
group_by(ID, SUBID) %>%
mutate(diff = Date - lag(Date)) %>%
mutate(diff = ifelse(is.na(diff), 0, diff)) %>%
summarise(Average = sum(diff)/n())
输出:
ID SUBID Average
<chr> <dbl> <dbl>
1 A 1 6.5
2 A 2 5.5
编辑:如何处理week_ends
对于周末,更简单的解决方案是将这一天更改为下一个星期一:
df %>%
mutate(week_day = wday(Date,label = TRUE, abbr = FALSE)) %>%
mutate(Date = ifelse(week_day == "samedi", Date + days(2),
ifelse(week_day == "dimanche", Date + days(1), Date))) %>%
mutate(Date = as_datetime(Date))
这将创建带有日期名称的列 week_day
。如果这一天是“samedi”(星期六)或“dimanche”(星期日),它会将日期增加 2 天或 1 天,这样它就变成了星期一。然后,您只需要重新排序日期(df %>% arrange(ID, SUBID, Date))
并重新运行第一个代码。
由于我的本地语言是法语,所以您必须将 samedi
和 dimanche
更改为 saturday
和 sunday
对于假期,您可以通过创建表示假期的时间间隔变量来应用相同的系统,测试每个日期是否在此间隔内,如果是,则将日期更改为该日期的最后一天这个区间。
我有一个数据集如下:
+----+-------+---------------------+
| ID | SUBID | date |
+----+-------+---------------------+
| A | 1 | 2021-01-01 12:00:00 |
| A | 1 | 2021-01-02 01:00:00 |
| A | 1 | 2021-01-02 02:00:00 |
| A | 1 | 2021-01-03 03:00:00 |
| A | 2 | 2021-01-05 16:00:00 |
| A | 2 | 2021-01-06 13:00:00 |
| A | 2 | 2021-01-07 06:00:00 |
| A | 2 | 2021-01-08 08:00:00 |
| A | 2 | 2021-01-08 10:00:00 |
| A | 2 | 2021-01-08 11:00:00 |
| A | 3 | 2021-01-09 09:00:00 |
| A | 3 | 2021-01-10 19:00:00 |
| A | 3 | 2021-01-11 20:00:00 |
| A | 3 | 2021-01-12 22:00:00 |
| B | 1 | 2021-02-01 23:00:00 |
| B | 1 | 2021-02-02 15:00:00 |
| B | 1 | 2021-02-03 06:00:00 |
| B | 1 | 2021-02-04 08:00:00 |
| B | 2 | 2021-02-05 18:00:00 |
| B | 2 | 2021-02-05 19:00:00 |
| B | 2 | 2021-02-06 22:00:00 |
| B | 2 | 2021-02-06 23:00:00 |
| B | 2 | 2021-02-07 04:00:00 |
| B | 2 | 2021-02-08 02:00:00 |
| B | 3 | 2021-02-09 01:00:00 |
| B | 3 | 2021-02-10 03:00:00 |
| B | 3 | 2021-02-11 13:00:00 |
| B | 3 | 2021-02-12 14:00:00 |
+----+-------+---------------------+
我希望能够得到每个ID和SUBID组之间的时间差,以小时为单位,最好是营业时间,其中每个出现在周末或联邦假日的日期都可以移到一个最近的工作日(之前或之后),时间为 23:59:59,如下所示:
+----+-------+---------------------+------------------------------------------------------------------+
| ID | SUBID | date | timediff (hours) with preceding date for each group (ID, SUBID) |
+----+-------+---------------------+------------------------------------------------------------------+
| A | 1 | 2021-01-01 12:00:00 | 0 |
| A | 1 | 2021-01-02 01:00:00 | 13 |
| A | 1 | 2021-01-02 02:00:00 | 1 |
| A | 1 | 2021-01-03 03:00:00 | 1 |
| A | 2 | 2021-01-05 16:00:00 | 0 |
| A | 2 | 2021-01-06 13:00:00 | 21 |
| A | 2 | 2021-01-07 06:00:00 | 17 |
| A | 2 | 2021-01-08 08:00:00 | 2 |
| A | 2 | 2021-01-08 10:00:00 | 2 |
| A | 2 | 2021-01-08 11:00:00 | 1 |
| A | 3 | 2021-01-09 09:00:00 | 0 |
| A | 3 | 2021-01-10 19:00:00 | 36 |
| A | 3 | 2021-01-11 20:00:00 | 1 |
| A | 3 | 2021-01-12 22:00:00 | 1 |
| B | 1 | 2021-02-01 23:00:00 | 0 |
| B | 1 | 2021-02-02 15:00:00 | 16 |
| B | 1 | 2021-02-03 06:00:00 | 15 |
| B | 1 | 2021-02-04 08:00:00 | 26 |
| B | 2 | 2021-02-05 18:00:00 | 0 |
| B | 2 | 2021-02-05 19:00:00 | 1 |
| B | 2 | 2021-02-06 22:00:00 | 27 |
| B | 2 | 2021-02-06 23:00:00 | 1 |
| B | 2 | 2021-02-07 04:00:00 | 5 |
| B | 2 | 2021-02-08 02:00:00 | 22 |
| B | 3 | 2021-02-09 01:00:00 | 0 |
| B | 3 | 2021-02-10 03:00:00 | 26 |
| B | 3 | 2021-02-11 13:00:00 | 11 |
| B | 3 | 2021-02-12 14:00:00 | 1 |
+----+-------+---------------------+------------------------------------------------------------------+
最后我想计算平均时间,即每组时间差总和(ID、SUBID)除以每组总计数,如下所示:
+----+-------+------------------------------------------------------------+
| ID | SUBID | Average time (count per group/ total time diff of group ) |
+----+-------+------------------------------------------------------------+
| A | 1 | 15/4 |
| A | 2 | 43/6 |
| A | 3 | 38/4 |
| B | 1 | 57/4 |
| B | 2 | 56/6 |
| B | 3 | 38/4 |
+----+-------+------------------------------------------------------------+
我是 R 的新手,我遇到了 lubridate 来帮助我格式化日期,我能够使用下面的代码获得时间差异
df%>%
group_by(ID, SUBID) %>%
mutate(time_diff = difftime(date, lag(date), unit = 'min'))
但是我在获取工作日时间差异以及根据上次 table
获取平均时间时遇到了麻烦欢迎来到 SO!使用 dplyr
和 lubridate
:
使用的数据:
library(tidyverse)
library(lubridate)
df <- data.frame(ID = c("A","A","A","A"),
SUBID = c(1,1,2,2),
Date = lubridate::as_datetime(c("2021-01-01 12:00:00","2021-01-02 1:00:00","2021-01-01 2:00:00","2021-01-01 13:00:00")))
ID SUBID Date
1 A 1 2021-01-01 12:00:00
2 A 1 2021-01-02 01:00:00
3 A 2 2021-01-01 02:00:00
4 A 2 2021-01-01 13:00:00
代码:
df %>%
group_by(ID, SUBID) %>%
mutate(diff = Date - lag(Date)) %>%
mutate(diff = ifelse(is.na(diff), 0, diff)) %>%
summarise(Average = sum(diff)/n())
输出:
ID SUBID Average
<chr> <dbl> <dbl>
1 A 1 6.5
2 A 2 5.5
编辑:如何处理week_ends
对于周末,更简单的解决方案是将这一天更改为下一个星期一:
df %>%
mutate(week_day = wday(Date,label = TRUE, abbr = FALSE)) %>%
mutate(Date = ifelse(week_day == "samedi", Date + days(2),
ifelse(week_day == "dimanche", Date + days(1), Date))) %>%
mutate(Date = as_datetime(Date))
这将创建带有日期名称的列 week_day
。如果这一天是“samedi”(星期六)或“dimanche”(星期日),它会将日期增加 2 天或 1 天,这样它就变成了星期一。然后,您只需要重新排序日期(df %>% arrange(ID, SUBID, Date))
并重新运行第一个代码。
由于我的本地语言是法语,所以您必须将 samedi
和 dimanche
更改为 saturday
和 sunday
对于假期,您可以通过创建表示假期的时间间隔变量来应用相同的系统,测试每个日期是否在此间隔内,如果是,则将日期更改为该日期的最后一天这个区间。