R查找时间段之间的重叠
R Find overlap among time periods
经过大量思考和谷歌搜索,我找不到解决问题的方法,希望你能帮助我。
我有一个大型数据框,其中的 ID 列可以重复 2 次以上,开始日期和结束日期列组成一个时间段。我想按 ID 分组,找出该 ID 的任何时间段是否与另一个时间段重叠,如果是,则通过创建一个新列来标记它,例如说明该 ID 是否重叠。
这是一个已经包含所需新列的示例数据框:
structure(list(ID= c(34L, 34L, 80L, 80L, 81L, 81L, 81L, 94L,
94L), Start = structure(c(1072911600, 1262300400, 1157061600,
1277935200, 1157061600, 1277935200, 1157061600, 1075590000, 1285891200
), class = c("POSIXct", "POSIXt"), tzone = ""), End = structure(c(1262214000,
1409436000, 1251669600, 1404079200, 1251669600, 1404079200, 1251669600,
1264892400, 1475193600), class = c("POSIXct", "POSIXt"), tzone = ""),
Overlap = c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE,
FALSE, FALSE)), .Names = c("ID", "Start", "End", "Overlap"
), row.names = c(NA, -9L), class = "data.frame")
ID Start End Overlap
34 2004-01-01 00:00:00 2009-12-31 00:00:00 FALSE
34 2010-01-01 00:00:00 2014-08-31 00:00:00 FALSE
80 2006-09-01 00:00:00 2009-08-31 00:00:00 FALSE
80 2010-07-01 00:00:00 2014-06-30 00:00:00 FALSE
81 2006-09-01 00:00:00 2009-08-31 00:00:00 TRUE
81 2010-07-01 00:00:00 2014-06-30 00:00:00 TRUE
81 2006-09-01 00:00:00 2009-08-31 00:00:00 TRUE
94 2004-02-01 00:00:00 2010-01-31 00:00:00 FALSE
94 2010-10-01 02:00:00 2016-09-30 02:00:00 FALSE
在这种情况下,对于 ID“81”,两个时间段之间存在重叠,因此我想将 ID = 81 的所有行标记为 TRUE,这意味着该 ID 的至少两行重叠被找到。这只是一个理想的解决方案,但总的来说,我想做的就是在按 ID 分组时找出重叠部分,因此标记它的方式可以灵活一些,以防它简化事情。
在此先感谢您的帮助。
我认为这就是您要查找的代码?让我知道。
data<- structure(list(ID= c(34L, 34L, 80L, 80L, 81L, 81L, 81L, 94L,
94L), Start = structure(c(1072911600, 1262300400, 1157061600,
1277935200, 1157061600, 1277935200, 1157061600, 1075590000, 1285891200
), class = c("POSIXct", "POSIXt"), tzone = ""), End = structure(c(1262214000,
1409436000, 1251669600, 1404079200, 1251669600, 1404079200, 1251669600,
1264892400, 1475193600), class = c("POSIXct", "POSIXt"), tzone = ""),
Overlap = c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE,
FALSE, FALSE)), .Names = c("ID", "Start", "End", "Overlap"
), row.names = c(NA, -9L), class = "data.frame")
library("dplyr")
library("lubridate")
overlaps<- function(intervals){
for(i in 1:(length(intervals)-1)){
for(j in (i+1):length(intervals)){
if(int_overlaps(intervals[i],intervals[j])){
return(TRUE)
}
}
}
return(FALSE)
}
data %>%
mutate(Interval=interval(Start,End))%>%
group_by(ID) %>%
do({
df<-.
ovl<- overlaps(df$Interval)
return(data.frame(ID=df$ID[1], ovl))
})
此外,我希望有人能为我的 overlaps
函数提出一个更优雅的解决方案..
另一种选择 - 假设 df
包含您的数据框,则:
library(data.table)
dt <- data.table(df, key=c("Start", "End"))[, `:=`(Overlap=NULL, row=1:nrow(df))]
overlapping <- unique(foverlaps(dt, dt)[ID==i.ID & row!=i.row, ID])
dt[, `:=`(Overlap=FALSE, row=NULL)][ID %in% overlapping, Overlap:=TRUE][order(ID, Start)]
# ID Start End Overlap
# 1: 34 2004-01-01 00:00:00 2009-12-31 00:00:00 FALSE
# 2: 34 2010-01-01 00:00:00 2014-08-31 00:00:00 FALSE
# 3: 80 2006-09-01 00:00:00 2009-08-31 00:00:00 FALSE
# 4: 80 2010-07-01 00:00:00 2014-06-30 00:00:00 FALSE
# 5: 81 2006-09-01 00:00:00 2009-08-31 00:00:00 TRUE
# 6: 81 2006-09-01 00:00:00 2009-08-31 00:00:00 TRUE
# 7: 81 2010-07-01 00:00:00 2014-06-30 00:00:00 TRUE
# 8: 94 2004-02-01 00:00:00 2010-01-31 00:00:00 FALSE
# 9: 94 2010-10-01 02:00:00 2016-09-30 02:00:00 FALSE
经过大量思考和谷歌搜索,我找不到解决问题的方法,希望你能帮助我。
我有一个大型数据框,其中的 ID 列可以重复 2 次以上,开始日期和结束日期列组成一个时间段。我想按 ID 分组,找出该 ID 的任何时间段是否与另一个时间段重叠,如果是,则通过创建一个新列来标记它,例如说明该 ID 是否重叠。
这是一个已经包含所需新列的示例数据框:
structure(list(ID= c(34L, 34L, 80L, 80L, 81L, 81L, 81L, 94L,
94L), Start = structure(c(1072911600, 1262300400, 1157061600,
1277935200, 1157061600, 1277935200, 1157061600, 1075590000, 1285891200
), class = c("POSIXct", "POSIXt"), tzone = ""), End = structure(c(1262214000,
1409436000, 1251669600, 1404079200, 1251669600, 1404079200, 1251669600,
1264892400, 1475193600), class = c("POSIXct", "POSIXt"), tzone = ""),
Overlap = c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE,
FALSE, FALSE)), .Names = c("ID", "Start", "End", "Overlap"
), row.names = c(NA, -9L), class = "data.frame")
ID Start End Overlap
34 2004-01-01 00:00:00 2009-12-31 00:00:00 FALSE
34 2010-01-01 00:00:00 2014-08-31 00:00:00 FALSE
80 2006-09-01 00:00:00 2009-08-31 00:00:00 FALSE
80 2010-07-01 00:00:00 2014-06-30 00:00:00 FALSE
81 2006-09-01 00:00:00 2009-08-31 00:00:00 TRUE
81 2010-07-01 00:00:00 2014-06-30 00:00:00 TRUE
81 2006-09-01 00:00:00 2009-08-31 00:00:00 TRUE
94 2004-02-01 00:00:00 2010-01-31 00:00:00 FALSE
94 2010-10-01 02:00:00 2016-09-30 02:00:00 FALSE
在这种情况下,对于 ID“81”,两个时间段之间存在重叠,因此我想将 ID = 81 的所有行标记为 TRUE,这意味着该 ID 的至少两行重叠被找到。这只是一个理想的解决方案,但总的来说,我想做的就是在按 ID 分组时找出重叠部分,因此标记它的方式可以灵活一些,以防它简化事情。
在此先感谢您的帮助。
我认为这就是您要查找的代码?让我知道。
data<- structure(list(ID= c(34L, 34L, 80L, 80L, 81L, 81L, 81L, 94L,
94L), Start = structure(c(1072911600, 1262300400, 1157061600,
1277935200, 1157061600, 1277935200, 1157061600, 1075590000, 1285891200
), class = c("POSIXct", "POSIXt"), tzone = ""), End = structure(c(1262214000,
1409436000, 1251669600, 1404079200, 1251669600, 1404079200, 1251669600,
1264892400, 1475193600), class = c("POSIXct", "POSIXt"), tzone = ""),
Overlap = c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE,
FALSE, FALSE)), .Names = c("ID", "Start", "End", "Overlap"
), row.names = c(NA, -9L), class = "data.frame")
library("dplyr")
library("lubridate")
overlaps<- function(intervals){
for(i in 1:(length(intervals)-1)){
for(j in (i+1):length(intervals)){
if(int_overlaps(intervals[i],intervals[j])){
return(TRUE)
}
}
}
return(FALSE)
}
data %>%
mutate(Interval=interval(Start,End))%>%
group_by(ID) %>%
do({
df<-.
ovl<- overlaps(df$Interval)
return(data.frame(ID=df$ID[1], ovl))
})
此外,我希望有人能为我的 overlaps
函数提出一个更优雅的解决方案..
另一种选择 - 假设 df
包含您的数据框,则:
library(data.table)
dt <- data.table(df, key=c("Start", "End"))[, `:=`(Overlap=NULL, row=1:nrow(df))]
overlapping <- unique(foverlaps(dt, dt)[ID==i.ID & row!=i.row, ID])
dt[, `:=`(Overlap=FALSE, row=NULL)][ID %in% overlapping, Overlap:=TRUE][order(ID, Start)]
# ID Start End Overlap
# 1: 34 2004-01-01 00:00:00 2009-12-31 00:00:00 FALSE
# 2: 34 2010-01-01 00:00:00 2014-08-31 00:00:00 FALSE
# 3: 80 2006-09-01 00:00:00 2009-08-31 00:00:00 FALSE
# 4: 80 2010-07-01 00:00:00 2014-06-30 00:00:00 FALSE
# 5: 81 2006-09-01 00:00:00 2009-08-31 00:00:00 TRUE
# 6: 81 2006-09-01 00:00:00 2009-08-31 00:00:00 TRUE
# 7: 81 2010-07-01 00:00:00 2014-06-30 00:00:00 TRUE
# 8: 94 2004-02-01 00:00:00 2010-01-31 00:00:00 FALSE
# 9: 94 2010-10-01 02:00:00 2016-09-30 02:00:00 FALSE