将一列的一行与组中的所有其他行进行比较

Question

我正在尝试计算组中所有对象与组中每个成员重叠的天数。为此，我想将一组中一列的每一行与同一组中该列中的每一行进行比较。但是，我无法为此提出一个简单的解决方案；我的大部分精力都花在了 purrr 的地图变体上。除此之外，我还进行了一些嵌套循环 (:-/)，嵌套应用兔子洞；但我怀疑有一种非常简单的方法可以完成这种比较。

本质上我想要一组中每个间隔的交集与组中的一行的总和。

输入数据：（带间隔的格式）

ID Group year  interval_obs  
1   A   2020 2020-04-29 UTC--2020-05-19 UTC  
2   A   2020 2020-05-04 UTC--2020-05-29 UTC  
3   A   2020 2020-05-09 UTC--2020-05-24 UTC  
4   A   2020 2020-04-24 UTC--2020-04-28 UTC  
5   A   2020 2020-05-30 UTC--2020-06-03 UTC  
6   B   2020 2019-12-31 UTC--2020-01-20 UTC  
7   B   2020 2020-01-10 UTC--2020-01-30 UTC  
8   B   2020 2020-01-20 UTC--2020-02-09 UTC  
9   B   2020 2020-01-15 UTC--2020-02-04 UTC

输入数据（更易读？）- 其中每个 start/end 是一年中的第几天 (doy)

ID Group Year start end
1   A   2020  120  140
2   A   2020  125  150
3   A   2020  130  145
4   A   2020  115  119
5   A   2020  151  155
6   B   2020    0   20
7   B   2020   10   30
8   B   2020   20   40
9   B   2020   15   35

期望的结果：

ID  total_overlap  
  1   25  
  2   30  
  3   25  
  4    0  
  5    0  
  6   15  
  7   35  
  8   25  
  9   35

请注意，所需的总重叠以天为单位，即 A 组中其他 4 个观察重叠的所有天数的总和。 B 组有 4 条记录以指示可变长度。

问题的示例数据

data <- structure(list(
  ID = 1:9,
  group = c("A", "A", "A", "A", "A", "B", "B", "B", "B"), 
  year = c(2020L, 2020L, 2020L, 2020L, 2020L, 2020L,  2020L, 2020L, 2020L), 
  start = c(120L, 125L, 130L, 115L, 151L, 0L, 10L, 20L, 15L),
  end = c(140L, 150L, 145L, 119L, 155L, 20L,  30L, 40L, 35L)),
  class = "data.frame", 
  row.names = c(NA, -9L))

data <- data %>% 
  group_by(group, year) %>% # real dataset has several combos - both vars left as reminder
  mutate(across(c(start, end), ~ as_date(., origin = paste0(year-1, "-12-31")))) %>%  #this year-1 term is due to leap years etc.
  mutate(interval_obs = interval(ymd(start), ymd(end))) %>% 
  dplyr::select(-start, -end)

output <- data %>% map(.x = .$interval_obs, # this code at least runs
              .f = ~{results = sum(as.numeric(intersect(.x, .y$interval_obs)))})

上面的小块是我处理这个问题的多种方式之一（map2、map_df 等），虽然它不起作用，但我想 (...) 一个解决方案就在那里球场。请注意，我的示例输出具有两个特征：1) 单位转换为天数，2) 'self intersection' 被减去。不要担心那些功能，我有办法做到这两个，我只是没有包括那些，因为它们可能会混淆问题。但是，如果它有帮助...

mutate(self_intersection = as.numeric(intersect(interval_obs, interval_obs2))) %>% 
mutate(results = results - self_intersection) %>% 
mutate(total_overlap = as.numeric(results)/86400))

我一直在尝试以 lubridate 或其他日期格式保存数据，以便将来可以轻松适应不同的时间分辨率（例如小时、分钟）

编辑 2 - 计算 A 组重叠的示例

（此处转载数据）

ID Group Year start end
1   A   2020  120  140
2   A   2020  125  150
3   A   2020  130  145
4   A   2020  115  119
5   A   2020  151  155

对于第1组，'comparison'后面的数字是指ID。

comparison 1 - 2. End1 - Start2 = 15 days  
comparison 1 - 3. End1 - Start2 = 10 days  
comparison 1 - 4. NO OVERLAP    =  0 days  
comparison 1 - 5. NO OVERLAP    =  0 days  
total_overlap                     25 days

Answer 1

这是您要找的吗？

第三行中的总重叠与您想要的输出不符，但这可能是一个错字？

library(tidyverse)
library(lubridate)

data |> 
  group_by(group) |> 
  mutate(total_overlap = map_dbl(interval_obs, 
                                 \(x) x |> 
                                   intersect(interval_obs) |> 
                                   int_length() |> 
                                   sum(na.rm = T) - int_length(x)
                                 ) / 86400
         )
#> # A tibble: 9 × 5
#> # Groups:   group [2]
#>      ID group  year interval_obs                   total_overlap
#>   <int> <chr> <int> <Interval>                             <dbl>
#> 1     1 A      2020 2020-04-29 UTC--2020-05-19 UTC            25
#> 2     2 A      2020 2020-05-04 UTC--2020-05-29 UTC            30
#> 3     3 A      2020 2020-05-09 UTC--2020-05-24 UTC            25
#> 4     4 A      2020 2020-04-24 UTC--2020-04-28 UTC             0
#> 5     5 A      2020 2020-05-30 UTC--2020-06-03 UTC             0
#> 6     6 B      2020 2019-12-31 UTC--2020-01-20 UTC            15
#> 7     7 B      2020 2020-01-10 UTC--2020-01-30 UTC            35
#> 8     8 B      2020 2020-01-20 UTC--2020-02-09 UTC            25
#> 9     9 B      2020 2020-01-15 UTC--2020-02-04 UTC            35

将一列的一行与组中的所有其他行进行比较

Compare one row of a column against all others in group

iteration

r

lubridate

purrr

编辑 2 - 计算 A 组重叠的示例