如何根据 R 中的(时间顺序)时间差异将观察结果分组在一起
How to group observations together based on (chronological) time differences in R
我正在尝试根据观察结果在时间上的接近程度将观察结果分组。
data <- data.frame(date = c("2020-04-14 03:26:58", "2020-04-14 11:26:58", "2020-04-14 12:29:20", "2020-04-14 12:48:02",
"2020-04-15 13:01:09", "2020-04-15 13:16:21", "2020-04-15 13:51:06", "2020-04-16 13:59:11",
"2020-04-16 14:01:37", "2020-04-18 20:02:37", "2020-04-18 20:17:37"))
data$date <- as.POSIXct(data$date, format="%Y-%m-%d %H:%M:%S")
head(data, 11)
date
1 2020-04-14 03:26:58
2 2020-04-14 11:26:58
3 2020-04-14 12:29:20
4 2020-04-14 12:48:02
5 2020-04-15 13:01:09
6 2020-04-15 13:16:21
7 2020-04-15 13:51:06
8 2020-04-16 13:59:11
9 2020-04-16 14:01:37
10 2020-04-18 20:02:37
11 2020-04-18 20:17:37
我想根据观测值是否发生在同一时间段内,将它们分配到 离散 组中:
例如,规则可以是:如果一行与其滞后行之间的时间差 小于 2 小时,则将行分组在一起。
我尝试创建一个滞后变量并计算每行与其滞后行之间的时间差,但我无法弄清楚如何从中获得添加组。
# Create lagged date variable
data$lag <- lag(data$date)
# Calculate time difference between original and lagged variable
data$time_diff <- as.numeric(difftime(data$date, data$lag, unit = "hours"))
在这种情况下,所需的输出将向 data
添加一个 group
列,例如:
date lag time_diff group
1 2020-04-14 03:26:58 <NA> NA A
2 2020-04-14 11:26:58 2020-04-14 03:26:58 8.00000000 B
3 2020-04-14 12:29:20 2020-04-14 11:26:58 1.03944444 B
4 2020-04-14 12:48:02 2020-04-14 12:29:20 0.31166667 B
5 2020-04-15 13:01:09 2020-04-14 12:48:02 24.21861111 C
6 2020-04-15 13:16:21 2020-04-15 13:01:09 0.25333333 C
7 2020-04-15 13:51:06 2020-04-15 13:16:21 0.57916667 C
8 2020-04-16 13:59:11 2020-04-15 13:51:06 24.13472222 D
9 2020-04-16 14:01:37 2020-04-16 13:59:11 0.04055556 D
10 2020-04-18 20:02:37 2020-04-16 14:01:37 54.01666667 E
11 2020-04-18 20:17:37 2020-04-18 20:02:37 0.25000000 E
一个data.table
选项:
setDT(data)
data[, diff_hours := as.numeric(difftime(date, shift(date, fill = date[1])), unit = "hours")]
data[, group := LETTERS[cumsum(diff_hours >= 2) + 1L]]
# date diff_hours group
# 1: 2020-04-14 03:26:58 0.00000000 A
# 2: 2020-04-14 11:26:58 8.00000000 B
# 3: 2020-04-14 12:29:20 1.03944444 B
# 4: 2020-04-14 12:48:02 0.31166667 B
# 5: 2020-04-15 13:01:09 24.21861111 C
# 6: 2020-04-15 13:16:21 0.25333333 C
# 7: 2020-04-15 13:51:06 0.57916667 C
# 8: 2020-04-16 13:59:11 24.13472222 D
# 9: 2020-04-16 14:01:37 0.04055556 D
# 10: 2020-04-18 20:02:37 54.01666667 E
# 11: 2020-04-18 20:17:37 0.25000000 E
我正在尝试根据观察结果在时间上的接近程度将观察结果分组。
data <- data.frame(date = c("2020-04-14 03:26:58", "2020-04-14 11:26:58", "2020-04-14 12:29:20", "2020-04-14 12:48:02",
"2020-04-15 13:01:09", "2020-04-15 13:16:21", "2020-04-15 13:51:06", "2020-04-16 13:59:11",
"2020-04-16 14:01:37", "2020-04-18 20:02:37", "2020-04-18 20:17:37"))
data$date <- as.POSIXct(data$date, format="%Y-%m-%d %H:%M:%S")
head(data, 11)
date
1 2020-04-14 03:26:58
2 2020-04-14 11:26:58
3 2020-04-14 12:29:20
4 2020-04-14 12:48:02
5 2020-04-15 13:01:09
6 2020-04-15 13:16:21
7 2020-04-15 13:51:06
8 2020-04-16 13:59:11
9 2020-04-16 14:01:37
10 2020-04-18 20:02:37
11 2020-04-18 20:17:37
我想根据观测值是否发生在同一时间段内,将它们分配到 离散 组中:
例如,规则可以是:如果一行与其滞后行之间的时间差 小于 2 小时,则将行分组在一起。
我尝试创建一个滞后变量并计算每行与其滞后行之间的时间差,但我无法弄清楚如何从中获得添加组。
# Create lagged date variable
data$lag <- lag(data$date)
# Calculate time difference between original and lagged variable
data$time_diff <- as.numeric(difftime(data$date, data$lag, unit = "hours"))
在这种情况下,所需的输出将向 data
添加一个 group
列,例如:
date lag time_diff group
1 2020-04-14 03:26:58 <NA> NA A
2 2020-04-14 11:26:58 2020-04-14 03:26:58 8.00000000 B
3 2020-04-14 12:29:20 2020-04-14 11:26:58 1.03944444 B
4 2020-04-14 12:48:02 2020-04-14 12:29:20 0.31166667 B
5 2020-04-15 13:01:09 2020-04-14 12:48:02 24.21861111 C
6 2020-04-15 13:16:21 2020-04-15 13:01:09 0.25333333 C
7 2020-04-15 13:51:06 2020-04-15 13:16:21 0.57916667 C
8 2020-04-16 13:59:11 2020-04-15 13:51:06 24.13472222 D
9 2020-04-16 14:01:37 2020-04-16 13:59:11 0.04055556 D
10 2020-04-18 20:02:37 2020-04-16 14:01:37 54.01666667 E
11 2020-04-18 20:17:37 2020-04-18 20:02:37 0.25000000 E
一个data.table
选项:
setDT(data)
data[, diff_hours := as.numeric(difftime(date, shift(date, fill = date[1])), unit = "hours")]
data[, group := LETTERS[cumsum(diff_hours >= 2) + 1L]]
# date diff_hours group
# 1: 2020-04-14 03:26:58 0.00000000 A
# 2: 2020-04-14 11:26:58 8.00000000 B
# 3: 2020-04-14 12:29:20 1.03944444 B
# 4: 2020-04-14 12:48:02 0.31166667 B
# 5: 2020-04-15 13:01:09 24.21861111 C
# 6: 2020-04-15 13:16:21 0.25333333 C
# 7: 2020-04-15 13:51:06 0.57916667 C
# 8: 2020-04-16 13:59:11 24.13472222 D
# 9: 2020-04-16 14:01:37 0.04055556 D
# 10: 2020-04-18 20:02:37 54.01666667 E
# 11: 2020-04-18 20:17:37 0.25000000 E