如何将数据框的每一行与另一个数据框的每一行进行比较并计算重叠
How to compare each row of a data frame to each row of another dataframe and calculate overlap
我有两个包含开始时间和结束时间的数据帧。我想将 df2 的每一行与 df1 的每一行进行比较并计算重叠。
df1
# start end
#1 5 15
#2 20 28
#3 46 68
#4 80 87
df2
# start end
#1 20 40
#2 65 85
所以结果应该是一个带有结果的向量
overlaping_duration_1= 8 (overlap from df2 row 1 with df1 row 1)
overlaping_duration_2= 3+5 = overlap from df2 row 2 with df1 row 3 + overlap from df2 row 2 with df1 row 4
我尝试了 ifelse 方法并涵盖了不同的条件。这仅适用于 df2 的第一行。
overlap = ifelse ( df2$start <= df1$start & df1$start <= df2$end & df2$end <= df1$end, df2$end-df1$start, 0)
overlap2 = ifelse ( df2$start <= df1$start & df1$end <= df2$end, df1$end-df1$start, 0)
overlap3 = ifelse ( df1$start < df2$start & df2$end <= df1$end, df2$end-df2$start, 0)
overlap4 = ifelse ( df1$start < df2$start & df2$start <= df1$end & df1$end <= df2$end, df1$end-df2$start, 0)
之后可以合并不同的重叠向量。这可以应用于 df2 上的 for
循环。
这种方法相当麻烦。有没有更舒服的方式?
library(data.table)
df1 = data.table(start=c(5,20,46,80),end=c(15,28,68,87))
df2 = data.table(start=c(20,65), end=c(40,85))
# add row identifer (`rn`) and dummy var (`id`) for cartesian join
df1[,`:=`(id=1, rn=.I)]
df2[,`:=`(id=1, rn=.I)]
# do full join
df = df1[df2,on="id", allow.cartesian=T]
# estimate overlap, by row
result = df[,overlap:=.(min(i.end,end)-max(i.start,start)), by=1:nrow(df)]
# retain positive overlaps, and sum by df2 row number
result[overlap>0, .(total = sum(overlap)), by = .(rn=i.rn)]
输出:
rn total
1: 1 8
2: 2 8
更新:您也可以避免完全加入,方法是在 start
和 end
上键入 df2
并使用 data.table::foverlaps
:
library(data.table)
df1 = data.table(start=c(5,20,46,80),end=c(15,28,68,87))
df2 = data.table(start=c(20,65), end=c(40,85))
setkey(df2,start,end)
df = foverlaps(df1,df2[,rn:=.I], nomatch=NULL)
df[,overlap:=.(min(i.end,end)-max(i.start,start)), by=1:nrow(df)][, .(total =sum(overlap,na.rm=T)),by=rn]
输出:
rn total
1: 1 8
2: 2 8
我有两个包含开始时间和结束时间的数据帧。我想将 df2 的每一行与 df1 的每一行进行比较并计算重叠。
df1
# start end
#1 5 15
#2 20 28
#3 46 68
#4 80 87
df2
# start end
#1 20 40
#2 65 85
所以结果应该是一个带有结果的向量
overlaping_duration_1= 8 (overlap from df2 row 1 with df1 row 1)
overlaping_duration_2= 3+5 = overlap from df2 row 2 with df1 row 3 + overlap from df2 row 2 with df1 row 4
我尝试了 ifelse 方法并涵盖了不同的条件。这仅适用于 df2 的第一行。
overlap = ifelse ( df2$start <= df1$start & df1$start <= df2$end & df2$end <= df1$end, df2$end-df1$start, 0)
overlap2 = ifelse ( df2$start <= df1$start & df1$end <= df2$end, df1$end-df1$start, 0)
overlap3 = ifelse ( df1$start < df2$start & df2$end <= df1$end, df2$end-df2$start, 0)
overlap4 = ifelse ( df1$start < df2$start & df2$start <= df1$end & df1$end <= df2$end, df1$end-df2$start, 0)
之后可以合并不同的重叠向量。这可以应用于 df2 上的 for
循环。
这种方法相当麻烦。有没有更舒服的方式?
library(data.table)
df1 = data.table(start=c(5,20,46,80),end=c(15,28,68,87))
df2 = data.table(start=c(20,65), end=c(40,85))
# add row identifer (`rn`) and dummy var (`id`) for cartesian join
df1[,`:=`(id=1, rn=.I)]
df2[,`:=`(id=1, rn=.I)]
# do full join
df = df1[df2,on="id", allow.cartesian=T]
# estimate overlap, by row
result = df[,overlap:=.(min(i.end,end)-max(i.start,start)), by=1:nrow(df)]
# retain positive overlaps, and sum by df2 row number
result[overlap>0, .(total = sum(overlap)), by = .(rn=i.rn)]
输出:
rn total
1: 1 8
2: 2 8
更新:您也可以避免完全加入,方法是在 start
和 end
上键入 df2
并使用 data.table::foverlaps
:
library(data.table)
df1 = data.table(start=c(5,20,46,80),end=c(15,28,68,87))
df2 = data.table(start=c(20,65), end=c(40,85))
setkey(df2,start,end)
df = foverlaps(df1,df2[,rn:=.I], nomatch=NULL)
df[,overlap:=.(min(i.end,end)-max(i.start,start)), by=1:nrow(df)][, .(total =sum(overlap,na.rm=T)),by=rn]
输出:
rn total
1: 1 8
2: 2 8