有效地将带时间戳的传感器数据组合到 R 中的事件中
Efficiently combine timestamped sensor data into events in R
底层数据集由传感器生成。每 6 秒,每个传感器都会发送一个信号,以识别范围内的所有人(持有钥匙)。忽略人,典型数据如下所示:
SensorID timestamp
2 2015-08-04 09:56:32
2 2015-08-04 09:56:38
2 2015-08-05 18:45:20
3 2015-08-04 09:54:33
3 2015-08-04 09:54:39
3 2015-08-04 09:57:31
3 2015-08-04 09:58:09
3 2015-08-04 09:58:15
3 2015-08-04 09:58:33
3 2015-08-04 09:58:39
我想将其转换为具有开始和结束时间的事件,其中来自同一传感器(和遥控钥匙)的连续信号如果相隔小于 60 秒,则被视为同一事件的一部分。
所以上面的测试数据会转化为:
SensorID startTime endTime sensorCount duration
2 2015-08-04 09:56:32 2015-08-04 09:56:38 2 6 secs
2 2015-08-05 18:45:20 2015-08-05 18:45:20 1 0 secs
3 2015-08-04 09:54:33 2015-08-04 09:54:39 2 6 secs
3 2015-08-04 09:57:31 2015-08-04 09:58:39 5 68 secs
我有可用的代码。
# identify the ends of sequences
lastKeep <- df$SensorID != df$SensorID[-1L] |
difftime(df$timestamp[-1L], df$timestamp, units = "secs") > 60
# set startTime and cumulative time and number of signals
df$startTime <- df$timestamp
df$endTime <- df$timestamp
df$sensorCount <- 1
for(jj in 2:nrow(df)) {
if (lastKeep[jj-1] == FALSE) {
df$startTime[jj] = df$startTime[jj-1]
df$sensorCount[jj] = df$sensorCount[jj-1] + 1
}
}
# select combined records and create duration
df <- df[lastKeep,]
df$duration <- difftime(df$endTime, df$startTime, units = "secs")
df$timestamp <- NULL
但是,对于我的2000条记录的实际测试数据,这段代码需要几秒钟的时间,而完整的数据集已经有650万条记录并且仍在收集中。因此,我需要一些高效的东西。
尽管它依赖 'previous' 记录来提供累积时间和信号计数,但有没有办法对其进行矢量化?
我目前的计划是使用 Rcpp,但我的 C++ 技能充其量只是一般。或者,是否有可以折叠连续信号记录的 R 包?我在时间序列或信号处理领域找不到,但它们不是我的领域,所以我可能遗漏了一些明显的东西。
这是一个可能的 data.table
解决方案,使用 devel version on GH,它应该足够有效
library(data.table) #V 1.9.5+
setDT(df)[, timestamp := as.POSIXct(timestamp)] # Make sure it's a valid POSIXct class
df[, .(
startTime = timestamp[1L],
endTime = timestamp[.N],
sensorCount = .N,
duration = difftime(timestamp[.N], timestamp[1L], units = "secs")
),
by = .(SensorID,
cumsum(difftime(timestamp, shift(timestamp, fill = timestamp[1L]), "secs") > 60))]
# SensorID cumsum startTime endTime sensorCount duration
# 1: 2 0 2015-08-04 09:56:32 2015-08-04 09:56:38 2 6 secs
# 2: 2 1 2015-08-05 18:45:20 2015-08-05 18:45:20 1 0 secs
# 3: 3 1 2015-08-04 09:54:33 2015-08-04 09:54:39 2 6 secs
# 4: 3 2 2015-08-04 09:57:31 2015-08-04 09:58:39 5 68 secs
这里的想法是按每个传感器内超过 60 秒的时间差的累积和进行分组,然后分配第一个和最后一个时间戳、组计数、每组第一个和最后一个时间戳之间的差异。
...和 dplyr (+ lubridate) 方法,假设 dt 是上面提供的数据集:
library(dplyr)
library(lubridate)
dt %>%
mutate(timestamp = ymd_hms(timestamp)) %>%
group_by(SensorID) %>% # for each sensor
mutate(dist = as.numeric(difftime(timestamp, # create distance between consecutive signals
lag(timestamp, default=min(timestamp)),
units = "secs"))) %>%
mutate(flag = ifelse(dist > 60, 1, 0), # flag distances > 60''
sessionID = cumsum(flag)+1) %>% # create session id
group_by(SensorID, sessionID) %>% # for each sensor and session
summarise(startTime = min(timestamp), # get start, end and counts
endTime = max(timestamp),
sensorCount = n()) %>%
mutate(duration = difftime(endTime, startTime, units="secs")) %>% # get duration
ungroup()
# SensorID sessionID startTime endTime sensorCount duration
# 1 2 1 2015-08-04 09:56:32 2015-08-04 09:56:38 2 6 secs
# 2 2 2 2015-08-05 18:45:20 2015-08-05 18:45:20 1 0 secs
# 3 3 1 2015-08-04 09:54:33 2015-08-04 09:54:39 2 6 secs
# 4 3 2 2015-08-04 09:57:31 2015-08-04 09:58:39 5 68 secs
底层数据集由传感器生成。每 6 秒,每个传感器都会发送一个信号,以识别范围内的所有人(持有钥匙)。忽略人,典型数据如下所示:
SensorID timestamp
2 2015-08-04 09:56:32
2 2015-08-04 09:56:38
2 2015-08-05 18:45:20
3 2015-08-04 09:54:33
3 2015-08-04 09:54:39
3 2015-08-04 09:57:31
3 2015-08-04 09:58:09
3 2015-08-04 09:58:15
3 2015-08-04 09:58:33
3 2015-08-04 09:58:39
我想将其转换为具有开始和结束时间的事件,其中来自同一传感器(和遥控钥匙)的连续信号如果相隔小于 60 秒,则被视为同一事件的一部分。
所以上面的测试数据会转化为:
SensorID startTime endTime sensorCount duration
2 2015-08-04 09:56:32 2015-08-04 09:56:38 2 6 secs
2 2015-08-05 18:45:20 2015-08-05 18:45:20 1 0 secs
3 2015-08-04 09:54:33 2015-08-04 09:54:39 2 6 secs
3 2015-08-04 09:57:31 2015-08-04 09:58:39 5 68 secs
我有可用的代码。
# identify the ends of sequences
lastKeep <- df$SensorID != df$SensorID[-1L] |
difftime(df$timestamp[-1L], df$timestamp, units = "secs") > 60
# set startTime and cumulative time and number of signals
df$startTime <- df$timestamp
df$endTime <- df$timestamp
df$sensorCount <- 1
for(jj in 2:nrow(df)) {
if (lastKeep[jj-1] == FALSE) {
df$startTime[jj] = df$startTime[jj-1]
df$sensorCount[jj] = df$sensorCount[jj-1] + 1
}
}
# select combined records and create duration
df <- df[lastKeep,]
df$duration <- difftime(df$endTime, df$startTime, units = "secs")
df$timestamp <- NULL
但是,对于我的2000条记录的实际测试数据,这段代码需要几秒钟的时间,而完整的数据集已经有650万条记录并且仍在收集中。因此,我需要一些高效的东西。
尽管它依赖 'previous' 记录来提供累积时间和信号计数,但有没有办法对其进行矢量化?
我目前的计划是使用 Rcpp,但我的 C++ 技能充其量只是一般。或者,是否有可以折叠连续信号记录的 R 包?我在时间序列或信号处理领域找不到,但它们不是我的领域,所以我可能遗漏了一些明显的东西。
这是一个可能的 data.table
解决方案,使用 devel version on GH,它应该足够有效
library(data.table) #V 1.9.5+
setDT(df)[, timestamp := as.POSIXct(timestamp)] # Make sure it's a valid POSIXct class
df[, .(
startTime = timestamp[1L],
endTime = timestamp[.N],
sensorCount = .N,
duration = difftime(timestamp[.N], timestamp[1L], units = "secs")
),
by = .(SensorID,
cumsum(difftime(timestamp, shift(timestamp, fill = timestamp[1L]), "secs") > 60))]
# SensorID cumsum startTime endTime sensorCount duration
# 1: 2 0 2015-08-04 09:56:32 2015-08-04 09:56:38 2 6 secs
# 2: 2 1 2015-08-05 18:45:20 2015-08-05 18:45:20 1 0 secs
# 3: 3 1 2015-08-04 09:54:33 2015-08-04 09:54:39 2 6 secs
# 4: 3 2 2015-08-04 09:57:31 2015-08-04 09:58:39 5 68 secs
这里的想法是按每个传感器内超过 60 秒的时间差的累积和进行分组,然后分配第一个和最后一个时间戳、组计数、每组第一个和最后一个时间戳之间的差异。
...和 dplyr (+ lubridate) 方法,假设 dt 是上面提供的数据集:
library(dplyr)
library(lubridate)
dt %>%
mutate(timestamp = ymd_hms(timestamp)) %>%
group_by(SensorID) %>% # for each sensor
mutate(dist = as.numeric(difftime(timestamp, # create distance between consecutive signals
lag(timestamp, default=min(timestamp)),
units = "secs"))) %>%
mutate(flag = ifelse(dist > 60, 1, 0), # flag distances > 60''
sessionID = cumsum(flag)+1) %>% # create session id
group_by(SensorID, sessionID) %>% # for each sensor and session
summarise(startTime = min(timestamp), # get start, end and counts
endTime = max(timestamp),
sensorCount = n()) %>%
mutate(duration = difftime(endTime, startTime, units="secs")) %>% # get duration
ungroup()
# SensorID sessionID startTime endTime sensorCount duration
# 1 2 1 2015-08-04 09:56:32 2015-08-04 09:56:38 2 6 secs
# 2 2 2 2015-08-05 18:45:20 2015-08-05 18:45:20 1 0 secs
# 3 3 1 2015-08-04 09:54:33 2015-08-04 09:54:39 2 6 secs
# 4 3 2 2015-08-04 09:57:31 2015-08-04 09:58:39 5 68 secs