如何计算特定时间段内的事件数
How to calculate number of events during specific time period
我正在尝试计算在 "df1" 定义的时间段内 "df2" 中的事件数(每一行都是一个事件)。我可以在大约 5 分钟的整个时间段内执行此操作,但是我想将时间段分成更小的块(1 分钟)并进行相同的计算
df1<- structure(list(Location = 1:10, Lattitude = c(57.140532, 57.140527,
57.13959, 57.13974, 57.14059, 57.14058, 57.1398, 57.13989, 57.14158,
57.14386), t_in = structure(c(1455626730, 1455627326, 1455628122,
1455628644, 1455629174, 1455629708, 1455630230, 1455630765, 1455631396,
1455631931), class = c("POSIXct", "POSIXt"), tzone = ""), t_out = structure(c(1455627047,
1455627615, 1455628462, 1455628933, 1455629486, 1455630015, 1455630552,
1455631070, 1455631719, 1455632242), class = c("POSIXct", "POSIXt"
), tzone = "")), .Names = c("Location", "Lattitude", "t_in",
"t_out"), class = "data.frame", row.names = c(NA, -10L))
df2<- structure(list(date.time = structure(c(1455630964, 1455630976,
1455630987, 1455630998, 1455631009, 1455631021, 1455631032, 1455631043,
1455631054, 1455631066, 1455631077, 1455631088, 1455631099, 1455631111,
1455631423, 1455631446, 1455631479, 1455631502, 1455631569, 1455631772
), class = c("POSIXct", "POSIXt"), tzone = ""), code = structure(c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L), .Label = c("1003", "32221"), class = "factor"),
rec_id = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("301976",
"301978", "301985", "301988"), class = "factor"), Lattitude = c("57.14066",
"57.14066", "57.14066", "57.14066", "57.14066", "57.14066",
"57.14066", "57.14066", "57.14066", "57.14066", "57.14066",
"57.14066", "57.14066", "57.14066", "57.141869", "57.141869",
"57.141869", "57.141869", "57.141869", "57.141869"), Longitude = c("2.075702",
"2.075702", "2.075702", "2.075702", "2.075702", "2.075702",
"2.075702", "2.075702", "2.075702", "2.075702", "2.075702",
"2.075702", "2.075702", "2.075702", "2.081576", "2.081576",
"2.081576", "2.081576", "2.081576", "2.081576"), Location = list(
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, NA, NA, NA, NA, 9, 9, 9,
9, 9, NA)), .Names = c("date.time", "code", "rec_id",
"Lattitude", "Longitude", "Location"), row.names = 94:113, class = "data.frame")
函数 returns 如果 df2 中的 date.time 位于 df1$t_in 和 df1$t_out 之间,则从 df1 定位。这似乎是一种迂回的方式,但可以使用此代码进行以后的计算
ids <- as.numeric(df1$Location)
f <- function(x){
a <- ids[ (df1$t_in < x) & (x < df1$t_out) ]
if (length(a) == 0) NA else a
}
df2$Location <- lapply(df2$date.time, f)
上面returns一个列表,所以需要把它转成数字。
有点麻烦,但无法绕过它
df2$Location<- paste(df2$Location)
df2$Location<- as.numeric(df2$Location)
然后删除 NA,因为它们位于 df1 中定义的时间段之外,因此不相关。
df2<-df2[!is.na(df2$Location),]
然后计算每个 rec_id 和位置
的事件数(即每行)
library (plyr)
df3 <- ddply(df2, c("rec_id","Location"), function(df){data.frame (detections=nrow(df))})
rec_id Location detections
1 301976 9 5
2 301978 8 10
...完美!
但是我想在更短的时间内执行此操作。准确地说是每一分钟。期间应从每个位置的 t_in (df1) 开始,直到 t_out (df1)。我可以在 excel 中做大量工作来做到这一点,但它肯定可以在 R 中自动化(它是一个大数据集)。
所以最终我可以计算 df1
中 t_in 和 t_out 之间每 1 分钟时间段内每个位置的事件数(nrow)
例如(只是视觉示例而非实际数据):
rec_id Location minute(or period) detections
301976 9 1 1
301976 9 2 2
301976 9 3 0
301976 9 4 0
301976 9 5 2
301978 8 1 4
301978 8 2 3
301978 8 3 1
301978 8 4 0
301978 8 5 2
我可以从第一个位置创建间隔,但我不确定如何进一步应用它
seq(from = head(df1$t_in,1), to = head(df1$t_out,1) , by = "mins")
我认为以下内容可用于生成带有序列分割输出的新 df1
数据框,然后您可以将上述步骤应用到新的 df1
.
它们可以结合使用,但我只是想确保它确实能满足您的需求。
首先,我们扩展原始数据框中的时间间隔并生成扩展周期列表。 df1
中的每一行都成为列表中的一个元素。
res1 <- sapply(1:nrow(df1), function(i) {
seq(from = df1$t_in[i], to = df1$t_out[i] , by = "mins")})
然后我们将序列列表转换为数据框(两列)
res2 <- lapply(res1, function(x) {
data.frame(t_in = x[1:(length(x)-1)], t_out=x[2:length(x)]) })
最后我们将所有内容合并在一起
df1v2 <- Reduce(function(...) merge(..., all=T), res2)
然后(调整您的代码)
ids <- seq_len(nrow(df1v2))
f <- function(x){
a <- ids[ (df1v2$t_in < x) & (x < df1v2$t_out) ]
if (length(a) == 0) NA else a
}
df2$Location <- lapply(df2$date.time, f)
产生
date.time code rec_id Lattitude Longitude Location
94 2016-02-16 14:56:04 32221 301978 57.14066 2.075702 37
95 2016-02-16 14:56:16 32221 301978 57.14066 2.075702 37
96 2016-02-16 14:56:27 32221 301978 57.14066 2.075702 37
97 2016-02-16 14:56:38 32221 301978 57.14066 2.075702 37
98 2016-02-16 14:56:49 32221 301978 57.14066 2.075702 38
99 2016-02-16 14:57:01 32221 301978 57.14066 2.075702 38
100 2016-02-16 14:57:12 32221 301978 57.14066 2.075702 38
101 2016-02-16 14:57:23 32221 301978 57.14066 2.075702 38
102 2016-02-16 14:57:34 32221 301978 57.14066 2.075702 38
103 2016-02-16 14:57:46 32221 301978 57.14066 2.075702 NA
104 2016-02-16 14:57:57 32221 301978 57.14066 2.075702 NA
105 2016-02-16 14:58:08 32221 301978 57.14066 2.075702 NA
106 2016-02-16 14:58:19 32221 301978 57.14066 2.075702 NA
107 2016-02-16 14:58:31 32221 301978 57.14066 2.075702 NA
108 2016-02-16 15:03:43 32221 301976 57.141869 2.081576 39
109 2016-02-16 15:04:06 32221 301976 57.141869 2.081576 39
110 2016-02-16 15:04:39 32221 301976 57.141869 2.081576 40
111 2016-02-16 15:05:02 32221 301976 57.141869 2.081576 40
112 2016-02-16 15:06:09 32221 301976 57.141869 2.081576 41
113 2016-02-16 15:09:32 32221 301976 57.141869 2.081576 NA
我不确定边界检查是否正确(修改 f
),但看起来您已经明白了。加速有多重要?
我正在尝试计算在 "df1" 定义的时间段内 "df2" 中的事件数(每一行都是一个事件)。我可以在大约 5 分钟的整个时间段内执行此操作,但是我想将时间段分成更小的块(1 分钟)并进行相同的计算
df1<- structure(list(Location = 1:10, Lattitude = c(57.140532, 57.140527,
57.13959, 57.13974, 57.14059, 57.14058, 57.1398, 57.13989, 57.14158,
57.14386), t_in = structure(c(1455626730, 1455627326, 1455628122,
1455628644, 1455629174, 1455629708, 1455630230, 1455630765, 1455631396,
1455631931), class = c("POSIXct", "POSIXt"), tzone = ""), t_out = structure(c(1455627047,
1455627615, 1455628462, 1455628933, 1455629486, 1455630015, 1455630552,
1455631070, 1455631719, 1455632242), class = c("POSIXct", "POSIXt"
), tzone = "")), .Names = c("Location", "Lattitude", "t_in",
"t_out"), class = "data.frame", row.names = c(NA, -10L))
df2<- structure(list(date.time = structure(c(1455630964, 1455630976,
1455630987, 1455630998, 1455631009, 1455631021, 1455631032, 1455631043,
1455631054, 1455631066, 1455631077, 1455631088, 1455631099, 1455631111,
1455631423, 1455631446, 1455631479, 1455631502, 1455631569, 1455631772
), class = c("POSIXct", "POSIXt"), tzone = ""), code = structure(c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L), .Label = c("1003", "32221"), class = "factor"),
rec_id = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("301976",
"301978", "301985", "301988"), class = "factor"), Lattitude = c("57.14066",
"57.14066", "57.14066", "57.14066", "57.14066", "57.14066",
"57.14066", "57.14066", "57.14066", "57.14066", "57.14066",
"57.14066", "57.14066", "57.14066", "57.141869", "57.141869",
"57.141869", "57.141869", "57.141869", "57.141869"), Longitude = c("2.075702",
"2.075702", "2.075702", "2.075702", "2.075702", "2.075702",
"2.075702", "2.075702", "2.075702", "2.075702", "2.075702",
"2.075702", "2.075702", "2.075702", "2.081576", "2.081576",
"2.081576", "2.081576", "2.081576", "2.081576"), Location = list(
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, NA, NA, NA, NA, 9, 9, 9,
9, 9, NA)), .Names = c("date.time", "code", "rec_id",
"Lattitude", "Longitude", "Location"), row.names = 94:113, class = "data.frame")
函数 returns 如果 df2 中的 date.time 位于 df1$t_in 和 df1$t_out 之间,则从 df1 定位。这似乎是一种迂回的方式,但可以使用此代码进行以后的计算
ids <- as.numeric(df1$Location)
f <- function(x){
a <- ids[ (df1$t_in < x) & (x < df1$t_out) ]
if (length(a) == 0) NA else a
}
df2$Location <- lapply(df2$date.time, f)
上面returns一个列表,所以需要把它转成数字。 有点麻烦,但无法绕过它
df2$Location<- paste(df2$Location)
df2$Location<- as.numeric(df2$Location)
然后删除 NA,因为它们位于 df1 中定义的时间段之外,因此不相关。
df2<-df2[!is.na(df2$Location),]
然后计算每个 rec_id 和位置
的事件数(即每行)library (plyr)
df3 <- ddply(df2, c("rec_id","Location"), function(df){data.frame (detections=nrow(df))})
rec_id Location detections
1 301976 9 5
2 301978 8 10
...完美!
但是我想在更短的时间内执行此操作。准确地说是每一分钟。期间应从每个位置的 t_in (df1) 开始,直到 t_out (df1)。我可以在 excel 中做大量工作来做到这一点,但它肯定可以在 R 中自动化(它是一个大数据集)。
所以最终我可以计算 df1
中 t_in 和 t_out 之间每 1 分钟时间段内每个位置的事件数(nrow)例如(只是视觉示例而非实际数据):
rec_id Location minute(or period) detections
301976 9 1 1
301976 9 2 2
301976 9 3 0
301976 9 4 0
301976 9 5 2
301978 8 1 4
301978 8 2 3
301978 8 3 1
301978 8 4 0
301978 8 5 2
我可以从第一个位置创建间隔,但我不确定如何进一步应用它
seq(from = head(df1$t_in,1), to = head(df1$t_out,1) , by = "mins")
我认为以下内容可用于生成带有序列分割输出的新 df1
数据框,然后您可以将上述步骤应用到新的 df1
.
它们可以结合使用,但我只是想确保它确实能满足您的需求。
首先,我们扩展原始数据框中的时间间隔并生成扩展周期列表。 df1
中的每一行都成为列表中的一个元素。
res1 <- sapply(1:nrow(df1), function(i) {
seq(from = df1$t_in[i], to = df1$t_out[i] , by = "mins")})
然后我们将序列列表转换为数据框(两列)
res2 <- lapply(res1, function(x) {
data.frame(t_in = x[1:(length(x)-1)], t_out=x[2:length(x)]) })
最后我们将所有内容合并在一起
df1v2 <- Reduce(function(...) merge(..., all=T), res2)
然后(调整您的代码)
ids <- seq_len(nrow(df1v2))
f <- function(x){
a <- ids[ (df1v2$t_in < x) & (x < df1v2$t_out) ]
if (length(a) == 0) NA else a
}
df2$Location <- lapply(df2$date.time, f)
产生
date.time code rec_id Lattitude Longitude Location
94 2016-02-16 14:56:04 32221 301978 57.14066 2.075702 37
95 2016-02-16 14:56:16 32221 301978 57.14066 2.075702 37
96 2016-02-16 14:56:27 32221 301978 57.14066 2.075702 37
97 2016-02-16 14:56:38 32221 301978 57.14066 2.075702 37
98 2016-02-16 14:56:49 32221 301978 57.14066 2.075702 38
99 2016-02-16 14:57:01 32221 301978 57.14066 2.075702 38
100 2016-02-16 14:57:12 32221 301978 57.14066 2.075702 38
101 2016-02-16 14:57:23 32221 301978 57.14066 2.075702 38
102 2016-02-16 14:57:34 32221 301978 57.14066 2.075702 38
103 2016-02-16 14:57:46 32221 301978 57.14066 2.075702 NA
104 2016-02-16 14:57:57 32221 301978 57.14066 2.075702 NA
105 2016-02-16 14:58:08 32221 301978 57.14066 2.075702 NA
106 2016-02-16 14:58:19 32221 301978 57.14066 2.075702 NA
107 2016-02-16 14:58:31 32221 301978 57.14066 2.075702 NA
108 2016-02-16 15:03:43 32221 301976 57.141869 2.081576 39
109 2016-02-16 15:04:06 32221 301976 57.141869 2.081576 39
110 2016-02-16 15:04:39 32221 301976 57.141869 2.081576 40
111 2016-02-16 15:05:02 32221 301976 57.141869 2.081576 40
112 2016-02-16 15:06:09 32221 301976 57.141869 2.081576 41
113 2016-02-16 15:09:32 32221 301976 57.141869 2.081576 NA
我不确定边界检查是否正确(修改 f
),但看起来您已经明白了。加速有多重要?