比较同一数据框中的日期间隔
Compare date intervals within the same data frame
我四处搜索并找到了类似的问题,但可以使它适用于我的数据。
我有一个包含开始日期和结束日期以及其他几个因素的数据框。理想情况下,一行的开始日期应晚于任何前一行的结束日期,但数据有重复的开始或结束,有时日期的间隔会重叠。
我试着做了一个可重现的例子:
df = data.frame(start=c("2018/04/15 9:00:00","2018/04/15 9:00:00","2018/04/16 10:20:00","2018/04/16 15:30:00",
"2018/04/17 12:40:00","2018/04/17 18:50:00"),
end=c("2018/04/16 8:00:00","2018/04/16 7:10:00","2018/04/17 18:20:00","2018/04/16 16:30:00",
"2018/04/17 16:40:00","2018/04/17 19:50:00"),
value=c(10,15,11,13,14,12))
我能够删除重复的(结束或开始日期),但我无法删除重叠的时间间隔。我想创建一个 "cleans" 包含在任何更大间隔内的间隔的循环。所以结果看起来像这样:
result = df[c(1,3,6),]
我原以为我可以制作一个循环 "clean" 既重复又重叠的间隔,但我做不到。
有什么建议吗?
data.table
包适用于使用重叠连接函数 foverlaps
(受 Bioconductor 包 IRanges 中的 findOverlaps 函数启发)然后反连接(data.table 语法是 B[!A, on]
) 以删除那些内部间隔。
library(data.table)
cols <- c("start", "end")
setDT(df)
df[, (cols) := lapply(.SD, function(x) as.POSIXct(x, format="%Y/%m/%d %H:%M:%S")), .SDcols=cols]
setkeyv(df, cols)
anti <- foverlaps(df, df, type="within")[start!=i.start | end!=i.end | value!=i.value]
df[!anti, on=.(start=i.start, end=i.end, value=i.value)]
# start end value
# 1: 2018-04-15 09:00:00 2018-04-16 08:00:00 10
# 2: 2018-04-16 10:20:00 2018-04-17 18:20:00 11
# 3: 2018-04-17 18:50:00 2018-04-17 19:50:00 12
另一种方法是使用 lubridate()
包的 %within%
:
library(lubridate)
# transform characters to dates
start_time <- as_datetime(df[ , "start"], tz = "UTC")
end_time <- as_datetime(df[ , "end"], tz = "UTC")
# construct intervals
start_end_intrvls <- interval(start_time, end_time)
# find indices of the non-within intervals
not_within <- !(sapply(FUN = function(i) any(start_end_intrvls[i] %within% start_end_intrvls[-i]),
X = seq(along.with = df[ , "start"])))
df[not_within, ]
# start end value
# 1 2018/04/15 9:00:00 2018/04/16 8:00:00 10
# 3 2018/04/16 10:20:00 2018/04/17 18:20:00 11
# 6 2018/04/17 18:50:00 2018/04/17 19:50:00 12
更新
as_datetime()
函数在应用于 tibble 时导致错误:
as_datetime(tibble("2018/04/15 9:00:00"), tz = "UTC")
Error in as.POSIXct.default(x) :
do not know how to convert 'x' to class “POSIXct”
可以修改上述解决方案,将 as_datetime()
替换为 as.POSIXlt()
来解决此问题:
df_tibble <- tibble(start=c("2018/04/15 9:00:00","2018/04/15 9:00:00","2018/04/16 10:20:00",
"2018/04/16 15:30:00", "2018/04/17 12:40:00","2018/04/17 18:50:00"),
end=c("2018/04/16 8:00:00","2018/04/16 7:10:00","2018/04/17 18:20:00","2018/04/16 16:30:00",
"2018/04/17 16:40:00","2018/04/17 19:50:00"), value=c(10,15,11,13,14,12))
start_time_lst <- lapply(FUN = function(i) as.POSIXlt(as.character(df_tibble[i , "start"]),
tz = "UTC"),
X = seq(along.with = unlist(df_tibble[ , "start"])))
end_time_lst <- lapply(FUN = function(i) as.POSIXlt(as.character(df_tibble[ i, "end"]),
tz = "UTC"),
X = seq(along.with = unlist(df_tibble[ , "end"])))
start_end_intrvls <- lapply(function(i) interval(start_time_lst[[i]] , end_time_lst[[i]]),
X = seq(along.with = unlist(df_tibble[ , "start"])))
not_within <- sapply(function(i) !(any(unlist(Map(`%within%`,
start_end_intrvls[[i]], start_end_intrvls[-i])))),
X = seq(along.with = unlist(df_tibble[ , "start"])))
我四处搜索并找到了类似的问题,但可以使它适用于我的数据。
我有一个包含开始日期和结束日期以及其他几个因素的数据框。理想情况下,一行的开始日期应晚于任何前一行的结束日期,但数据有重复的开始或结束,有时日期的间隔会重叠。
我试着做了一个可重现的例子:
df = data.frame(start=c("2018/04/15 9:00:00","2018/04/15 9:00:00","2018/04/16 10:20:00","2018/04/16 15:30:00",
"2018/04/17 12:40:00","2018/04/17 18:50:00"),
end=c("2018/04/16 8:00:00","2018/04/16 7:10:00","2018/04/17 18:20:00","2018/04/16 16:30:00",
"2018/04/17 16:40:00","2018/04/17 19:50:00"),
value=c(10,15,11,13,14,12))
我能够删除重复的(结束或开始日期),但我无法删除重叠的时间间隔。我想创建一个 "cleans" 包含在任何更大间隔内的间隔的循环。所以结果看起来像这样:
result = df[c(1,3,6),]
我原以为我可以制作一个循环 "clean" 既重复又重叠的间隔,但我做不到。
有什么建议吗?
data.table
包适用于使用重叠连接函数 foverlaps
(受 Bioconductor 包 IRanges 中的 findOverlaps 函数启发)然后反连接(data.table 语法是 B[!A, on]
) 以删除那些内部间隔。
library(data.table)
cols <- c("start", "end")
setDT(df)
df[, (cols) := lapply(.SD, function(x) as.POSIXct(x, format="%Y/%m/%d %H:%M:%S")), .SDcols=cols]
setkeyv(df, cols)
anti <- foverlaps(df, df, type="within")[start!=i.start | end!=i.end | value!=i.value]
df[!anti, on=.(start=i.start, end=i.end, value=i.value)]
# start end value
# 1: 2018-04-15 09:00:00 2018-04-16 08:00:00 10
# 2: 2018-04-16 10:20:00 2018-04-17 18:20:00 11
# 3: 2018-04-17 18:50:00 2018-04-17 19:50:00 12
另一种方法是使用 lubridate()
包的 %within%
:
library(lubridate)
# transform characters to dates
start_time <- as_datetime(df[ , "start"], tz = "UTC")
end_time <- as_datetime(df[ , "end"], tz = "UTC")
# construct intervals
start_end_intrvls <- interval(start_time, end_time)
# find indices of the non-within intervals
not_within <- !(sapply(FUN = function(i) any(start_end_intrvls[i] %within% start_end_intrvls[-i]),
X = seq(along.with = df[ , "start"])))
df[not_within, ]
# start end value
# 1 2018/04/15 9:00:00 2018/04/16 8:00:00 10
# 3 2018/04/16 10:20:00 2018/04/17 18:20:00 11
# 6 2018/04/17 18:50:00 2018/04/17 19:50:00 12
更新
as_datetime()
函数在应用于 tibble 时导致错误:
as_datetime(tibble("2018/04/15 9:00:00"), tz = "UTC")
Error in as.POSIXct.default(x) : do not know how to convert 'x' to class “POSIXct”
可以修改上述解决方案,将 as_datetime()
替换为 as.POSIXlt()
来解决此问题:
df_tibble <- tibble(start=c("2018/04/15 9:00:00","2018/04/15 9:00:00","2018/04/16 10:20:00",
"2018/04/16 15:30:00", "2018/04/17 12:40:00","2018/04/17 18:50:00"),
end=c("2018/04/16 8:00:00","2018/04/16 7:10:00","2018/04/17 18:20:00","2018/04/16 16:30:00",
"2018/04/17 16:40:00","2018/04/17 19:50:00"), value=c(10,15,11,13,14,12))
start_time_lst <- lapply(FUN = function(i) as.POSIXlt(as.character(df_tibble[i , "start"]),
tz = "UTC"),
X = seq(along.with = unlist(df_tibble[ , "start"])))
end_time_lst <- lapply(FUN = function(i) as.POSIXlt(as.character(df_tibble[ i, "end"]),
tz = "UTC"),
X = seq(along.with = unlist(df_tibble[ , "end"])))
start_end_intrvls <- lapply(function(i) interval(start_time_lst[[i]] , end_time_lst[[i]]),
X = seq(along.with = unlist(df_tibble[ , "start"])))
not_within <- sapply(function(i) !(any(unlist(Map(`%within%`,
start_end_intrvls[[i]], start_end_intrvls[-i])))),
X = seq(along.with = unlist(df_tibble[ , "start"])))