时间戳之间的子集记录
Subset records between timestamps
我有两个数据框,trips
是具有唯一 id
的自行车进行的独特旅行,以及 intervals
,它们显示每 10 次自行车 ID 的位置分钟。我的 objective 是 从 intervals
中删除 记录,如果 time
介于 start
和 finish
和 bike_id
s 是一样的。时间是 posixCT
class 并且原始数据帧有数十万条记录
例如,下面这两个数据集的结果应该是:
> trips
bike_id start finish
1 1 2017-11-22 15:52:36 2017-11-22 17:47:53
2 2 2017-11-22 16:05:44 2017-11-22 16:23:25
3 3 2017-11-22 16:31:06 2017-11-22 17:11:20
> intervals
time bike_id
3 2017-11-22 16:00:03 1
4 2017-11-22 16:10:03 1
5 2017-11-22 16:20:02 1
6 2017-11-22 16:30:02 1
7 2017-11-22 16:40:03 1
8 2017-11-22 16:50:02 1
9 2017-11-22 17:00:02 1
10 2017-11-22 17:10:02 1
11 2017-11-22 17:20:03 1
12 2017-11-22 17:30:03 1
13 2017-11-22 16:00:03 2
14 2017-11-22 16:10:03 2
15 2017-11-22 16:20:02 2
16 2017-11-22 16:30:02 2
17 2017-11-22 16:40:03 2
18 2017-11-22 16:50:02 2
19 2017-11-22 17:00:02 2
20 2017-11-22 17:10:02 2
21 2017-11-22 17:20:03 2
22 2017-11-22 17:30:03 2
23 2017-11-22 16:30:02 3
24 2017-11-22 16:40:03 3
25 2017-11-22 16:50:02 3
26 2017-11-22 17:00:02 3
27 2017-11-22 17:10:02 3
28 2017-11-22 17:20:03 3
29 2017-11-22 17:30:03 3
结果
> outcome
time bike_id
13 2017-11-22 16:00:03 2
16 2017-11-22 16:30:02 2
17 2017-11-22 16:40:03 2
18 2017-11-22 16:50:02 2
19 2017-11-22 17:00:02 2
20 2017-11-22 17:10:02 2
21 2017-11-22 17:20:03 2
22 2017-11-22 17:30:03 2
23 2017-11-22 16:30:02 3
28 2017-11-22 17:20:03 3
29 2017-11-22 17:30:03 3
不确定从哪里开始。任何关于从哪里开始使用 dplyr
或 apply
函数的建议都将不胜感激!
示例数据如下:
> dput(intervals)
structure(list(time = structure(c(1511384403.94561, 1511385003.17654,
1511385602.47887, 1511386202.99895, 1511386803.18361, 1511387402.98233,
1511388002.69461, 1511388602.5818, 1511389203.52712, 1511389803.652,
1511384403.94561, 1511385003.17654, 1511385602.47887, 1511386202.99895,
1511386803.18361, 1511387402.98233, 1511388002.69461, 1511388602.5818,
1511389203.52712, 1511389803.652, 1511386202.99895, 1511386803.18361,
1511387402.98233, 1511388002.69461, 1511388602.5818, 1511389203.52712,
1511389803.652), class = c("POSIXct", "POSIXt"), tzone = ""),
bike_id = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3)), .Names = c("time",
"bike_id"), row.names = 3:29, class = "data.frame")
> dput(trips)
structure(list(bike_id = c(1, 2, 3), start = structure(c(1511383956,
1511384744, 1511386266), class = c("POSIXct", "POSIXt"), tzone = ""),
finish = structure(c(1511390873, 1511385805, 1511388680), class = c("POSIXct",
"POSIXt"), tzone = "")), .Names = c("bike_id", "start", "finish"
), row.names = c(NA, 3L), class = "data.frame")
我是 data.table 包的新手,所以请仔细测试以下方法。
我选择data.table而不是dplyr的原因是因为这个任务需要按范围加入,而dplyr是做不到的现在执行。这是使用 foverlaps
函数的解决方案。
library(data.table)
# Convert the data frame to data.table
setDT(intervals)
setDT(trips)
# Create a second column time2, which is the same as time
intervals[, time2 := time]
# Set keys in trips
setkey(trips, bike_id, start, finish)
# Conduct join by bike_id and time
# The columns in intervals used for join by time
# needs to be in the last two in by.x
intervals2 <- foverlaps(intervals, trips,
by.x = c("bike_id", "time", "time2"))
# Filter the ones with NA in start, which means no match
# And then selct the time and bike_id column
outcome <- intervals2[is.na(start)][, .(time, bike_id)]
outcome
# time bike_id
# 1: 2017-11-22 16:00:03 2
# 2: 2017-11-22 16:30:02 2
# 3: 2017-11-22 16:40:03 2
# 4: 2017-11-22 16:50:02 2
# 5: 2017-11-22 17:00:02 2
# 6: 2017-11-22 17:10:02 2
# 7: 2017-11-22 17:20:03 2
# 8: 2017-11-22 17:30:03 2
# 9: 2017-11-22 16:30:02 3
# 10: 2017-11-22 17:20:03 3
# 11: 2017-11-22 17:30:03 3
这是我的答案。 trips
是参考数据集。
matched()
是一个函数,用于匹配从 trips
到 intervals
的开始和结束。
回答
trips <- data.frame(bike_id = 1:3,
start = as.POSIXct(c("2017-11-22 15:52:36", "2017-11-22 16:05:44", "2017-11-22 16:31:06")),
finish = as.POSIXct(c("2017-11-22 17:47:53","2017-11-22 16:23:25","2017-11-22 17:11:20")))%>%
mutate(start = as.numeric(start),
finish = as.numeric(finish))
matched <- function(var1, var2, df1, df2){
return(df2[,var1][match(df1[,var2],df2[,var2])])
}
intervals%>%
mutate(time_num = as.numeric(time),
start = matched("start", "bike_id", intervals , trips),
finish = matched("finish", "bike_id", intervals , trips))%>%
filter(time_num < start | time_num > finish)%>%
select(time, bike_id)
time bike_id
1 2017-11-22 16:00:03 2
2 2017-11-22 16:30:02 2
3 2017-11-22 16:40:03 2
4 2017-11-22 16:50:02 2
5 2017-11-22 17:00:02 2
6 2017-11-22 17:10:02 2
7 2017-11-22 17:20:03 2
8 2017-11-22 17:30:03 2
9 2017-11-22 16:30:02 3
10 2017-11-22 17:20:03 3
11 2017-11-22 17:30:03 3
由于某些奇怪的原因,我无法让 between()
工作。我稍后会看到。
这可以通过一种非相等反连接 来解决。
自 1.9.8 版(2016 年 11 月 25 日在 CRAN 上)起,data.table
中提供了非等值连接,并且可以用作 [= 的便捷替代品14=] 在许多情况下。特别是,foverlaps()
需要对第二个参数进行键控,而 non-equi join 同样适用于未键控和键控 data.tables。
首先,非等值连接 用于识别 intervals
行的索引,这些索引位于 start
和 finish
次 trips
。然后,这些行从 intervals
中删除
library(data.table)
tmp <- setDT(intervals)[setDT(trips), on = .(bike_id, time >= start, time <= finish),
which = TRUE]
intervals[!tmp]
time bike_id
1: 2017-11-22 16:00:03 2
2: 2017-11-22 16:30:02 2
3: 2017-11-22 16:40:03 2
4: 2017-11-22 16:50:02 2
5: 2017-11-22 17:00:02 2
6: 2017-11-22 17:10:02 2
7: 2017-11-22 17:20:03 2
8: 2017-11-22 17:30:03 2
9: 2017-11-22 16:30:02 3
10: 2017-11-22 17:20:03 3
11: 2017-11-22 17:30:03 3
tmp
包含要删除的行的索引:
tmp
[1] 1 2 3 4 5 6 7 8 9 10 12 13 22 23 24 25
我有两个数据框,trips
是具有唯一 id
的自行车进行的独特旅行,以及 intervals
,它们显示每 10 次自行车 ID 的位置分钟。我的 objective 是 从 intervals
中删除 记录,如果 time
介于 start
和 finish
和 bike_id
s 是一样的。时间是 posixCT
class 并且原始数据帧有数十万条记录
例如,下面这两个数据集的结果应该是:
> trips
bike_id start finish
1 1 2017-11-22 15:52:36 2017-11-22 17:47:53
2 2 2017-11-22 16:05:44 2017-11-22 16:23:25
3 3 2017-11-22 16:31:06 2017-11-22 17:11:20
> intervals
time bike_id
3 2017-11-22 16:00:03 1
4 2017-11-22 16:10:03 1
5 2017-11-22 16:20:02 1
6 2017-11-22 16:30:02 1
7 2017-11-22 16:40:03 1
8 2017-11-22 16:50:02 1
9 2017-11-22 17:00:02 1
10 2017-11-22 17:10:02 1
11 2017-11-22 17:20:03 1
12 2017-11-22 17:30:03 1
13 2017-11-22 16:00:03 2
14 2017-11-22 16:10:03 2
15 2017-11-22 16:20:02 2
16 2017-11-22 16:30:02 2
17 2017-11-22 16:40:03 2
18 2017-11-22 16:50:02 2
19 2017-11-22 17:00:02 2
20 2017-11-22 17:10:02 2
21 2017-11-22 17:20:03 2
22 2017-11-22 17:30:03 2
23 2017-11-22 16:30:02 3
24 2017-11-22 16:40:03 3
25 2017-11-22 16:50:02 3
26 2017-11-22 17:00:02 3
27 2017-11-22 17:10:02 3
28 2017-11-22 17:20:03 3
29 2017-11-22 17:30:03 3
结果
> outcome
time bike_id
13 2017-11-22 16:00:03 2
16 2017-11-22 16:30:02 2
17 2017-11-22 16:40:03 2
18 2017-11-22 16:50:02 2
19 2017-11-22 17:00:02 2
20 2017-11-22 17:10:02 2
21 2017-11-22 17:20:03 2
22 2017-11-22 17:30:03 2
23 2017-11-22 16:30:02 3
28 2017-11-22 17:20:03 3
29 2017-11-22 17:30:03 3
不确定从哪里开始。任何关于从哪里开始使用 dplyr
或 apply
函数的建议都将不胜感激!
示例数据如下:
> dput(intervals)
structure(list(time = structure(c(1511384403.94561, 1511385003.17654,
1511385602.47887, 1511386202.99895, 1511386803.18361, 1511387402.98233,
1511388002.69461, 1511388602.5818, 1511389203.52712, 1511389803.652,
1511384403.94561, 1511385003.17654, 1511385602.47887, 1511386202.99895,
1511386803.18361, 1511387402.98233, 1511388002.69461, 1511388602.5818,
1511389203.52712, 1511389803.652, 1511386202.99895, 1511386803.18361,
1511387402.98233, 1511388002.69461, 1511388602.5818, 1511389203.52712,
1511389803.652), class = c("POSIXct", "POSIXt"), tzone = ""),
bike_id = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3)), .Names = c("time",
"bike_id"), row.names = 3:29, class = "data.frame")
> dput(trips)
structure(list(bike_id = c(1, 2, 3), start = structure(c(1511383956,
1511384744, 1511386266), class = c("POSIXct", "POSIXt"), tzone = ""),
finish = structure(c(1511390873, 1511385805, 1511388680), class = c("POSIXct",
"POSIXt"), tzone = "")), .Names = c("bike_id", "start", "finish"
), row.names = c(NA, 3L), class = "data.frame")
我是 data.table 包的新手,所以请仔细测试以下方法。
我选择data.table而不是dplyr的原因是因为这个任务需要按范围加入,而dplyr是做不到的现在执行。这是使用 foverlaps
函数的解决方案。
library(data.table)
# Convert the data frame to data.table
setDT(intervals)
setDT(trips)
# Create a second column time2, which is the same as time
intervals[, time2 := time]
# Set keys in trips
setkey(trips, bike_id, start, finish)
# Conduct join by bike_id and time
# The columns in intervals used for join by time
# needs to be in the last two in by.x
intervals2 <- foverlaps(intervals, trips,
by.x = c("bike_id", "time", "time2"))
# Filter the ones with NA in start, which means no match
# And then selct the time and bike_id column
outcome <- intervals2[is.na(start)][, .(time, bike_id)]
outcome
# time bike_id
# 1: 2017-11-22 16:00:03 2
# 2: 2017-11-22 16:30:02 2
# 3: 2017-11-22 16:40:03 2
# 4: 2017-11-22 16:50:02 2
# 5: 2017-11-22 17:00:02 2
# 6: 2017-11-22 17:10:02 2
# 7: 2017-11-22 17:20:03 2
# 8: 2017-11-22 17:30:03 2
# 9: 2017-11-22 16:30:02 3
# 10: 2017-11-22 17:20:03 3
# 11: 2017-11-22 17:30:03 3
这是我的答案。 trips
是参考数据集。
matched()
是一个函数,用于匹配从 trips
到 intervals
的开始和结束。
回答
trips <- data.frame(bike_id = 1:3,
start = as.POSIXct(c("2017-11-22 15:52:36", "2017-11-22 16:05:44", "2017-11-22 16:31:06")),
finish = as.POSIXct(c("2017-11-22 17:47:53","2017-11-22 16:23:25","2017-11-22 17:11:20")))%>%
mutate(start = as.numeric(start),
finish = as.numeric(finish))
matched <- function(var1, var2, df1, df2){
return(df2[,var1][match(df1[,var2],df2[,var2])])
}
intervals%>%
mutate(time_num = as.numeric(time),
start = matched("start", "bike_id", intervals , trips),
finish = matched("finish", "bike_id", intervals , trips))%>%
filter(time_num < start | time_num > finish)%>%
select(time, bike_id)
time bike_id
1 2017-11-22 16:00:03 2
2 2017-11-22 16:30:02 2
3 2017-11-22 16:40:03 2
4 2017-11-22 16:50:02 2
5 2017-11-22 17:00:02 2
6 2017-11-22 17:10:02 2
7 2017-11-22 17:20:03 2
8 2017-11-22 17:30:03 2
9 2017-11-22 16:30:02 3
10 2017-11-22 17:20:03 3
11 2017-11-22 17:30:03 3
由于某些奇怪的原因,我无法让 between()
工作。我稍后会看到。
这可以通过一种非相等反连接 来解决。
自 1.9.8 版(2016 年 11 月 25 日在 CRAN 上)起,data.table
中提供了非等值连接,并且可以用作 [= 的便捷替代品14=] 在许多情况下。特别是,foverlaps()
需要对第二个参数进行键控,而 non-equi join 同样适用于未键控和键控 data.tables。
首先,非等值连接 用于识别 intervals
行的索引,这些索引位于 start
和 finish
次 trips
。然后,这些行从 intervals
library(data.table)
tmp <- setDT(intervals)[setDT(trips), on = .(bike_id, time >= start, time <= finish),
which = TRUE]
intervals[!tmp]
time bike_id
1: 2017-11-22 16:00:03 2
2: 2017-11-22 16:30:02 2
3: 2017-11-22 16:40:03 2
4: 2017-11-22 16:50:02 2
5: 2017-11-22 17:00:02 2
6: 2017-11-22 17:10:02 2
7: 2017-11-22 17:20:03 2
8: 2017-11-22 17:30:03 2
9: 2017-11-22 16:30:02 3
10: 2017-11-22 17:20:03 3
11: 2017-11-22 17:30:03 3
tmp
包含要删除的行的索引:
tmp
[1] 1 2 3 4 5 6 7 8 9 10 12 13 22 23 24 25