使用 tidyverse 成对比较组的重叠

Compare overlap of groups pairwise using tidyverse

我有一个整洁的 data.frame 这种格式:

library(tidyverse)
df = data.frame(name = c("Clarence","Clarence","Clarence","Shelby","Shelby", "Patricia","Patricia"), fruit = c("Apple", "Banana", "Grapes", "Apple", "Apricot", "Banana", "Grapes"))
df

#      name   fruit
#1 Clarence   Apple
#2 Clarence  Banana
#3 Clarence  Grapes
#4   Shelby   Apple
#5   Shelby Apricot
#6 Patricia  Banana
#7 Patricia  Grapes

我想以成对的方式比较组之间的重叠(即,如果两个人都有一个苹果,重叠为 1),这样我最终得到一个如下所示的数据框:

df2 = data.frame(names = c("Clarence-Shelby", "Clarence-Patricia", "Shelby-Patricia"), n_overlap  = c(1, 2, 0))
df2

#              names n_overlap
#1   Clarence-Shelby       1
#2 Clarence-Patricia       2
#3   Shelby-Patricia       0

在 tidyverse 框架中有没有一种优雅的方法可以做到这一点?我的真实数据集比这大得多,将按多列分组。

试试这个,

combinations <- apply(combn(unique(df$name), 2), 2, function(z) paste(sort(z), collapse = "-"))
combinations
# [1] "Clarence-Shelby"   "Clarence-Patricia" "Patricia-Shelby"  

library(dplyr)
df %>%
  group_by(fruit) %>%
  summarize(names = paste(sort(unique(name)), collapse = "-")) %>%
  right_join(tibble(names = combinations), by = "names") %>%
  group_by(names) %>%
  summarize(n_overlap = sum(!is.na(fruit)))
# # A tibble: 3 x 2
#   names             n_overlap
#   <chr>                 <int>
# 1 Clarence-Patricia         2
# 2 Clarence-Shelby           1
# 3 Patricia-Shelby           0

如果 0 重叠不重要,解决方案是:

> df %>% inner_join(df,by="fruit") %>% filter(name.x<name.y) %>% count(name.x,name.y)
    name.x   name.y n
1 Clarence Patricia 2
2 Clarence   Shelby 1

如果你真的需要 non-overlapping 对:

> a = df %>% inner_join(df,by="fruit") %>% filter(name.x<name.y) %>% count(name.x,name.y)
> b = as.data.frame(t(combn(sort(unique(df$name,2)),2)))
> colnames(b)=colnames(a)[1:2]
> a %>% full_join(b) %>% replace_na(list(n=0))
Joining, by = c("name.x", "name.y")
    name.x   name.y n
1 Clarence Patricia 2
2 Clarence   Shelby 1
3 Patricia   Shelby 0