在事件发生后的时间间隔内分配唯一 ID

Question

这是一个有点奇怪的案例，我一直无法在 Whosebug 上找到解决方案。我有一个数据集，其中包含一个日期时间列和一个指示事件的值列，例如下面的 dat 示例。日期时间是每小时，但请注意偶尔会出现“错过”的时间（第 12 行和第 13 行之间缺少 2 小时）。

dat <- data.frame(datetime = seq(min(as.POSIXct("2010-04-03 03:00:00 UTC")), 
                                 max(as.POSIXct("2010-04-04 10:00:00 UTC")), by = "hour")[-c(13,14)],
                  event = c(1, rep(NA, 9), 2, rep(NA, 5), 3, 4, rep(NA, 9), 5, NA, 6))

> dat
              datetime event
1  2010-04-03 03:00:00     1
2  2010-04-03 04:00:00    NA
3  2010-04-03 05:00:00    NA
4  2010-04-03 06:00:00    NA
5  2010-04-03 07:00:00    NA
6  2010-04-03 08:00:00    NA
7  2010-04-03 09:00:00    NA
8  2010-04-03 10:00:00    NA
9  2010-04-03 11:00:00    NA
10 2010-04-03 12:00:00    NA
11 2010-04-03 13:00:00     2
12 2010-04-03 14:00:00    NA
13 2010-04-03 17:00:00    NA
14 2010-04-03 18:00:00    NA
15 2010-04-03 19:00:00    NA
16 2010-04-03 20:00:00    NA
17 2010-04-03 21:00:00     3
18 2010-04-03 22:00:00     4
19 2010-04-03 23:00:00    NA
20 2010-04-04 00:00:00    NA
21 2010-04-04 01:00:00    NA
22 2010-04-04 02:00:00    NA
23 2010-04-04 03:00:00    NA
24 2010-04-04 04:00:00    NA
25 2010-04-04 05:00:00    NA
26 2010-04-04 06:00:00    NA
27 2010-04-04 07:00:00    NA
28 2010-04-04 08:00:00     5
29 2010-04-04 09:00:00    NA
30 2010-04-04 10:00:00     6

我希望事件发生后 7 小时间隔内的每一行都用唯一标识符标识，但有以下注意事项（因此是“奇怪的案例”）：

如果后续事件发生在前一个事件的 7 小时内，则该后续事件基本上会被忽略（即，“事件”编号不等于分配的标识符值），并且
考虑到缺失的时间（即规则基于经过的时间，而不是行数）。

产品看起来像 result:

library(dplyr)

result <- dat %>% 
  mutate(id = c(rep(1, 8), rep(NA, 2), rep(2, 6), rep(3, 8), rep(NA, 3), rep(4, 3)))

> result
              datetime event id
1  2010-04-03 03:00:00     1  1
2  2010-04-03 04:00:00    NA  1
3  2010-04-03 05:00:00    NA  1
4  2010-04-03 06:00:00    NA  1
5  2010-04-03 07:00:00    NA  1
6  2010-04-03 08:00:00    NA  1
7  2010-04-03 09:00:00    NA  1
8  2010-04-03 10:00:00    NA  1
9  2010-04-03 11:00:00    NA NA
10 2010-04-03 12:00:00    NA NA
11 2010-04-03 13:00:00     2  2
12 2010-04-03 14:00:00    NA  2
13 2010-04-03 17:00:00    NA  2
14 2010-04-03 18:00:00    NA  2
15 2010-04-03 19:00:00    NA  2
16 2010-04-03 20:00:00    NA  2
17 2010-04-03 21:00:00     3  3
18 2010-04-03 22:00:00     4  3
19 2010-04-03 23:00:00    NA  3
20 2010-04-04 00:00:00    NA  3
21 2010-04-04 01:00:00    NA  3
22 2010-04-04 02:00:00    NA  3
23 2010-04-04 03:00:00    NA  3
24 2010-04-04 04:00:00    NA  3
25 2010-04-04 05:00:00    NA NA
26 2010-04-04 06:00:00    NA NA
27 2010-04-04 07:00:00    NA NA
28 2010-04-04 08:00:00     5  4
29 2010-04-04 09:00:00    NA  4
30 2010-04-04 10:00:00     6  4

最理想的情况是，这将在 dplyr 框架中完成。

Answer 1

library(lubridate)
library(tidyverse)

dat <- data.frame(datetime = seq(min(as.POSIXct("2010-04-03 03:00:00 UTC")), 
                                 max(as.POSIXct("2010-04-04 10:00:00 UTC")), by = "hour")[-c(13,14)],
                  event = c(1, rep(NA, 9), 2, rep(NA, 5), 3, 4, rep(NA, 9), 5, NA, 6)) %>% 
  mutate(id = c(rep(1, 8), rep(NA, 2), rep(2, 6), rep(3, 8), rep(NA, 3), rep(4, 3)))


Events <- dat %>% 
  #Get only the roes with events
  filter(!is.na(event)) %>% 
  #Get the duration of time between events
  mutate(
    EventLag = datetime - lag(datetime)) %>% 
  ## remove events that occurred < 7 hrs after the previous or that are NA (i.e. the first one). but in the real data
  ## I do not suspect your first point would ever be an event...? Maybe this can be removed in the 
  ## real dataset...
  filter(as.numeric(EventLag) > 7| is.na(EventLag)) %>% 
  as.data.frame()

## You now have all of the events that are of interest (i.e. those that occurred outside of the 7 hr buffer)
## Give the events a new ID so there are no gaps
## Join them with the rest of the datetime stamps
Events <- Events %>% 
  mutate(ID = row_number()) %>% 
  dplyr::select(datetime, ID)


## Expand each event by 7 hrs
Events <- Events %>%
  group_by(ID) %>%
  do(data.frame(ID= .$ID, datetime= seq(.$datetime, .$datetime + hours(7), by = '1 hour'), stringsAsFactors=FALSE)) %>% 
  as.data.frame()


## Join with initial data by datettime
DatJoin <- dat %>% 
  left_join(Events, by = "datetime")


DatJoin

在事件发生后的时间间隔内分配唯一 ID

Assign unique ID within a time interval following an event

datetime

r

time-series

uniqueidentifier