根据至少一个公共值对 ID 进行分组

Grouping Ids based on at least one common values

我有一个列表,其元素是整数,如果它们至少共享一个值,我想累积这些元素。对于那些与其他元素没有任何共同价值的元素,我希望它们保持原样。这是我的样本日期:

x <- list(c(1, 2), c(1, 2, 3), c(2, 3, 4), c(3, 4, 5), c(4, 5, 8), c(6, 9, 7), 7, c(5, 8), 10, 11)

这是我想要的输出:

desired_reult <- list(c(1, 2, 3, 4, 5, 8), 
                      c(6, 9, 7), 
                      10, 
                      11)

我想首先使用 purrr 中的 reduceaccumulate 函数,但欢迎任何其他 tidyverse 解决方案。到目前为止我已经尝试过这个解决方案,但它只给了我一个 union 并且显然放弃了其余的:

x %>% 
  reduce(~ if(any(.x %in% .y)) union(.x, .y) else .x)

[1] 1 2 3 4 5 8

总的来说,我正在寻找一种方法来将具有公共值的整数 (id) 分组,例如一种聚类,但不幸的是,到目前为止我的努力都是徒劳的。

非常感谢您的提前帮助。

我怀疑有一套覆盖解决方案,但在此期间,这里有一个图形方法:

首先,让我们将整数向量转换为边列表,以便将其制成图形。我们可以使用 expand.grid.

library(igraph)
edgelist <- do.call(rbind,lapply(x,\(x)expand.grid(x,x))) #R version >= 4.1.0

现在我们有两列 data.frame 显示所有整数(一组边)之间的连接。

igraph::graph.data.frame可以很方便的据此做图

从那里我们可以使用 igraph::components 来提取连通分量。

g <- graph.data.frame(edgelist)
split(names(components(g)$membership),components(g)$membership)
#$`1`
#[1] "1" "2" "3" "4" "5" "8"
#$`2`
#[1] "6" "9" "7"
#$`3`
#[1] "10"
#$`4`
#[1] "11"

或使用 Tidyverse:

library(dplyr); library(purrr)
map_dfr(x, ~expand.grid(.x,.x)) %>%
  graph.data.frame() %>%
  components() %>% 
  pluck(membership) %>%
  stack() %>%
  {split(as.numeric(as.character(.[,2])),.[,1])}

$`1`
[1] 1 2 3 4 5 8

$`2`
[1] 6 9 7

$`3`
[1] 10

$`4`
[1] 11

一种方法:

i 与 j 相邻当且仅当 intersect(i, j) != 空集。我们想找到在位置 (i,j) 上有 1 个矩阵的连通分量当且仅当集合 i 与集合 j 相邻,否则为 0。前 4 行构建邻接矩阵,第 5 行和第 6 行找到连接的组件,其余的是根据该成员资格拆分列表并取唯一值。

library(tidyverse)
library(igraph)

map(x, function(a) map_int(x, ~length(base::intersect(a, .x)) > 0) * 1L) %>% 
  reduce(rbind) %>%
  graph.adjacency() %>%
  as.undirected() %>%
  components() %>%
  pluck("membership") %>%
  split(seq_along(.), .) %>%
  map(~unique(unlist(x[.x])))

感谢我亲爱的朋友@Ian Canmpbell 介绍的非常有用的信息 post,我想挑战自己为此目的编写一个自定义函数。它仍然是第一个版本,虽然不是很优雅并且肯定可以大大改进但是现在它是稳定的因为我在一些输入上尝试它并没有让人失望。

anoush <- function(x) {
# First we check whether x is a list

  stopifnot(is.list(x)) 

# Then we take every element of the input and calculate the intersect between
# that element & others. In case there were some we would store the indices 
# in `vec` vector. So in the end we have a list called `ind` whose elements 
# are all the indices connected with the corresponding elements of the original 
# list for example first element of `ind` is `1`, `2`, `3` which means in 
# the original list these elements have common values.
  
  ind <- lapply(1:length(x), function(a) {
    vec <- c()
    for(i in 1:length(x)) {
      if(length(unique(base::intersect(x[[a]], x[[i]]))) > 0) {
        vec <- c(vec, i)
      }
    }
    vec 
    })

# Then we go on to again compare each element of `ind` with other elements
# in case there were any intersect, we will calculate the `union` of them.
# for each element we will end up with a list of accumulated values but
# but in the end we use `Reduce` to capture only the last one. So for each
# element of `ind` we end up having a collection of indices that also 
# result in duplicated values. For example elements `1` through `5` of 
# `dup_ind` contains the same value cause in the original list these 
# elements have common values.

  dup_ind <- lapply(1:length(ind), function(a) {
    out <- c()
    for(i in 1:length(ind)) {
      if(length(unique(base::intersect(ind[[a]], ind[[i]]))) > 0) {
        out[[i]] <- union(ind[[a]], ind[[i]])
      }
      vec2 <- Reduce("union", out)
    }
    vec2
  }) 

# Here we get rid of the duplicated elements of the list by means of 
# `relist` funciton and since in this process all the duplicated elements
# will turn to `integer(0)` I have filtered those out.
  
  un <- unlist(dup_ind)
  res <- Map(`[`, dup_ind, relist(!duplicated(un), skeleton = dup_ind))
  res2 <- Filter(length, res)
  
  sapply(res2, function(a) unique(unlist(lapply(a, function(b) `[[`(x, b)))))
  
}

输出

> anoush(x)

[[1]]
[1] 1 2 3 4 5 8

[[2]]
[1] 6 9 7

[[3]]
[1] 10

[[4]]
[1] 11