问题与 dplyr 和削减时间间隔
Issue with dplyr and cut for time intervals
我正在尝试将数据分成 5 秒的时间间隔并使用 dplyr 对它们进行分组。
以下是我的原始数据 - 我在单独的列中有日期和时间,后来我使用 Posixct
将它们合并
structure(list(Date = c("10/30/2013", "10/30/2013", "10/30/2013",
"10/30/2013", "10/30/2013", "10/30/2013", "10/30/2013", "10/30/2013",
"10/30/2013", "10/30/2013", "10/30/2013", "10/30/2013", "10/30/2013",
"10/30/2013", "10/30/2013"), Time = c("20:06:57", "20:07:13",
"20:07:25", "20:07:30", "20:08:16", "20:08:17", "20:08:26", "20:09:05",
"20:09:06", "20:09:07", "20:09:37", "20:09:38", "20:09:55", "20:12:34",
"20:14:15"), ID = c("M1", "M1", "M1", "M3", "M1", "M1", "M8",
"M9", "M9", "M9", "M1", "M1", "M1", "M5", "M1")), .Names = c("Date",
"Time", "ID"), class = "data.frame", row.names = c(NA, -15L))
在下面附上我的代码
data$datetime <- as.POSIXct(paste(data$Date, data$Time), format="%m/%d/%Y %H:%M:%S")
data_order <- data %>% arrange(datetime,ID)
data_order$group <- data_order %>% group_by(by5sec=cut(datetime, breaks= "5 secs",right =T),ID) %>% group_indices()
虽然有些观察结果是正确的,但有些是错误的。我尝试了 2 个版本 - 删除 "right=T" 并保留它,我得到了不同的组,但两个版本都有错误。我也试过用 as.numeric,as.posixct 等都白费了
附上两个 versions.Red 的输出被错误地编码为 2 个不同的组。
****版本 1 "right = T" 用于剪切 ****
****版本 2 "right = F" 用于剪切 ****
有人可以帮忙解决这个问题吗,我花了很长时间,鉴于我对 R 的了解,它一直在追逐。我只想要相同 ID 的 5 秒休息时间(组应该更改为新 ID)。
期望输出
我不是很清楚你显示的输出图像。根据您的问题描述,这样的事情怎么样?
library(tidyverse);
df %>%
unite(datetime, 1:2, sep = " ", remove = FALSE) %>%
mutate(
datetime = as.POSIXct(datetime, format = "%m/%d/%Y %H:%M:%S"),
datetime.by5sec = as.numeric(cut(datetime, "sec")) %/% 5 + 1);
# datetime Date Time ID datetime.by5sec
#1 2013-10-30 20:06:57 10/30/2013 20:06:57 M1 1
#2 2013-10-30 20:07:13 10/30/2013 20:07:13 M1 4
#3 2013-10-30 20:07:25 10/30/2013 20:07:25 M1 6
#4 2013-10-30 20:07:30 10/30/2013 20:07:30 M3 7
#5 2013-10-30 20:08:16 10/30/2013 20:08:16 M1 17
#6 2013-10-30 20:08:17 10/30/2013 20:08:17 M1 17
#7 2013-10-30 20:08:26 10/30/2013 20:08:26 M8 19
#8 2013-10-30 20:09:05 10/30/2013 20:09:05 M9 26
#9 2013-10-30 20:09:06 10/30/2013 20:09:06 M9 27
#10 2013-10-30 20:09:07 10/30/2013 20:09:07 M9 27
#11 2013-10-30 20:09:37 10/30/2013 20:09:37 M1 33
#12 2013-10-30 20:09:38 10/30/2013 20:09:38 M1 33
#13 2013-10-30 20:09:55 10/30/2013 20:09:55 M1 36
#14 2013-10-30 20:12:34 10/30/2013 20:12:34 M5 68
#15 2013-10-30 20:14:15 10/30/2013 20:14:15 M1 88
解释:datetime.by5sec
给出了 datetime
所在的 5 秒 bin 索引。所以第一个条目位于 bin 1 中。第二个条目位于第 4 个 5 秒 bin 内,即从第一个条目开始的 20 秒内,依此类推。这里我使用了整数除法 %/% 5
,因为 cut.POSIXct
只允许你以秒为间隔进行分箱。
更新
以下重现了您的预期输出:
df %>%
unite(datetime, 1:2, sep = " ", remove = FALSE) %>%
group_by(ID) %>%
mutate(
datetime = as.POSIXct(datetime, format = "%m/%d/%Y %H:%M:%S"),
difftime = difftime(datetime, lag(datetime, default = 0))) %>%
ungroup() %>%
mutate(
group = cumsum(abs(difftime) >= 5)) %>%
select(Date, Time, ID, datetime, group);
## A tibble: 15 x 5
# Date Time ID datetime group
# <chr> <chr> <chr> <dttm> <int>
# 1 10/30/2013 20:06:57 M1 2013-10-30 20:06:57 1
# 2 10/30/2013 20:07:13 M1 2013-10-30 20:07:13 2
# 3 10/30/2013 20:07:25 M1 2013-10-30 20:07:25 3
# 4 10/30/2013 20:07:30 M3 2013-10-30 20:07:30 4
# 5 10/30/2013 20:08:16 M1 2013-10-30 20:08:16 5
# 6 10/30/2013 20:08:17 M1 2013-10-30 20:08:17 5
# 7 10/30/2013 20:08:26 M8 2013-10-30 20:08:26 6
# 8 10/30/2013 20:09:05 M9 2013-10-30 20:09:05 7
# 9 10/30/2013 20:09:06 M9 2013-10-30 20:09:06 7
#10 10/30/2013 20:09:07 M9 2013-10-30 20:09:07 7
#11 10/30/2013 20:09:37 M1 2013-10-30 20:09:37 8
#12 10/30/2013 20:09:38 M1 2013-10-30 20:09:38 8
#13 10/30/2013 20:09:55 M1 2013-10-30 20:09:55 9
#14 10/30/2013 20:12:34 M5 2013-10-30 20:12:34 10
#15 10/30/2013 20:14:15 M1 2013-10-30 20:14:15 11
解释:计算两个连续datetime
条目之间的时间差,按ID
分组; group
那么就是所有时间差的累加和 >=5
.
我正在尝试将数据分成 5 秒的时间间隔并使用 dplyr 对它们进行分组。
以下是我的原始数据 - 我在单独的列中有日期和时间,后来我使用 Posixct
structure(list(Date = c("10/30/2013", "10/30/2013", "10/30/2013",
"10/30/2013", "10/30/2013", "10/30/2013", "10/30/2013", "10/30/2013",
"10/30/2013", "10/30/2013", "10/30/2013", "10/30/2013", "10/30/2013",
"10/30/2013", "10/30/2013"), Time = c("20:06:57", "20:07:13",
"20:07:25", "20:07:30", "20:08:16", "20:08:17", "20:08:26", "20:09:05",
"20:09:06", "20:09:07", "20:09:37", "20:09:38", "20:09:55", "20:12:34",
"20:14:15"), ID = c("M1", "M1", "M1", "M3", "M1", "M1", "M8",
"M9", "M9", "M9", "M1", "M1", "M1", "M5", "M1")), .Names = c("Date",
"Time", "ID"), class = "data.frame", row.names = c(NA, -15L))
在下面附上我的代码
data$datetime <- as.POSIXct(paste(data$Date, data$Time), format="%m/%d/%Y %H:%M:%S")
data_order <- data %>% arrange(datetime,ID)
data_order$group <- data_order %>% group_by(by5sec=cut(datetime, breaks= "5 secs",right =T),ID) %>% group_indices()
虽然有些观察结果是正确的,但有些是错误的。我尝试了 2 个版本 - 删除 "right=T" 并保留它,我得到了不同的组,但两个版本都有错误。我也试过用 as.numeric,as.posixct 等都白费了
附上两个 versions.Red 的输出被错误地编码为 2 个不同的组。
****版本 1 "right = T" 用于剪切 ****
****版本 2 "right = F" 用于剪切 ****
有人可以帮忙解决这个问题吗,我花了很长时间,鉴于我对 R 的了解,它一直在追逐。我只想要相同 ID 的 5 秒休息时间(组应该更改为新 ID)。
期望输出
我不是很清楚你显示的输出图像。根据您的问题描述,这样的事情怎么样?
library(tidyverse);
df %>%
unite(datetime, 1:2, sep = " ", remove = FALSE) %>%
mutate(
datetime = as.POSIXct(datetime, format = "%m/%d/%Y %H:%M:%S"),
datetime.by5sec = as.numeric(cut(datetime, "sec")) %/% 5 + 1);
# datetime Date Time ID datetime.by5sec
#1 2013-10-30 20:06:57 10/30/2013 20:06:57 M1 1
#2 2013-10-30 20:07:13 10/30/2013 20:07:13 M1 4
#3 2013-10-30 20:07:25 10/30/2013 20:07:25 M1 6
#4 2013-10-30 20:07:30 10/30/2013 20:07:30 M3 7
#5 2013-10-30 20:08:16 10/30/2013 20:08:16 M1 17
#6 2013-10-30 20:08:17 10/30/2013 20:08:17 M1 17
#7 2013-10-30 20:08:26 10/30/2013 20:08:26 M8 19
#8 2013-10-30 20:09:05 10/30/2013 20:09:05 M9 26
#9 2013-10-30 20:09:06 10/30/2013 20:09:06 M9 27
#10 2013-10-30 20:09:07 10/30/2013 20:09:07 M9 27
#11 2013-10-30 20:09:37 10/30/2013 20:09:37 M1 33
#12 2013-10-30 20:09:38 10/30/2013 20:09:38 M1 33
#13 2013-10-30 20:09:55 10/30/2013 20:09:55 M1 36
#14 2013-10-30 20:12:34 10/30/2013 20:12:34 M5 68
#15 2013-10-30 20:14:15 10/30/2013 20:14:15 M1 88
解释:datetime.by5sec
给出了 datetime
所在的 5 秒 bin 索引。所以第一个条目位于 bin 1 中。第二个条目位于第 4 个 5 秒 bin 内,即从第一个条目开始的 20 秒内,依此类推。这里我使用了整数除法 %/% 5
,因为 cut.POSIXct
只允许你以秒为间隔进行分箱。
更新
以下重现了您的预期输出:
df %>%
unite(datetime, 1:2, sep = " ", remove = FALSE) %>%
group_by(ID) %>%
mutate(
datetime = as.POSIXct(datetime, format = "%m/%d/%Y %H:%M:%S"),
difftime = difftime(datetime, lag(datetime, default = 0))) %>%
ungroup() %>%
mutate(
group = cumsum(abs(difftime) >= 5)) %>%
select(Date, Time, ID, datetime, group);
## A tibble: 15 x 5
# Date Time ID datetime group
# <chr> <chr> <chr> <dttm> <int>
# 1 10/30/2013 20:06:57 M1 2013-10-30 20:06:57 1
# 2 10/30/2013 20:07:13 M1 2013-10-30 20:07:13 2
# 3 10/30/2013 20:07:25 M1 2013-10-30 20:07:25 3
# 4 10/30/2013 20:07:30 M3 2013-10-30 20:07:30 4
# 5 10/30/2013 20:08:16 M1 2013-10-30 20:08:16 5
# 6 10/30/2013 20:08:17 M1 2013-10-30 20:08:17 5
# 7 10/30/2013 20:08:26 M8 2013-10-30 20:08:26 6
# 8 10/30/2013 20:09:05 M9 2013-10-30 20:09:05 7
# 9 10/30/2013 20:09:06 M9 2013-10-30 20:09:06 7
#10 10/30/2013 20:09:07 M9 2013-10-30 20:09:07 7
#11 10/30/2013 20:09:37 M1 2013-10-30 20:09:37 8
#12 10/30/2013 20:09:38 M1 2013-10-30 20:09:38 8
#13 10/30/2013 20:09:55 M1 2013-10-30 20:09:55 9
#14 10/30/2013 20:12:34 M5 2013-10-30 20:12:34 10
#15 10/30/2013 20:14:15 M1 2013-10-30 20:14:15 11
解释:计算两个连续datetime
条目之间的时间差,按ID
分组; group
那么就是所有时间差的累加和 >=5
.