生成后验矩阵的组间重合组合数

Number of coincident combinations between groups to generate a posterior matrix

我有一个类似 df:

的数据框
id <- c("A" , "A" , "A" , "A", "B", "B", "B", "C", "C", "C") 
type <- c(1, 4, 3, 6, 1, 4, 5, 2, 3, 6)
df <- data_frame(id, type)

我想计算每个 (id) 中发生的组合。

之后想用数据生成对称矩阵(A):

A = matrix(
  # Taking sequence of elements 
  c(NA, 0, 1, 2, 1, 1, 0, NA, 1, 0,0,1, 1, 1, NA, 1,0,2, 2, 0, 1, NA, 1, 1, 1,0,0,1, NA, 0, 1,1,2,1,0, NA),
  # No of rows
  nrow = 6,  
  # No of columns
  ncol = 6,        
  # By default matrices are in column-wise order
  # So this parameter decides how to arrange the matrix
  byrow = TRUE         
)
# Naming rows
rownames(A) = c("Type 1", "Type 2", "Type 3", "Type 4", "Type 5", "Type 6")

# Naming columns
colnames(A) = c("Type 1", "Type 2", "Type 3", "Type 4", "Type 5", "Type 6")

cat("Number of coincidences between Type by id")
print(A)

我的试炼是这样来的...

intermediate_step <- expand.grid(Variety1=unique(df$Type),    # reshape with a symmetric output
                  Variety2=unique(df$Type), stringsAsFactors = F) %>%
  mutate(counts = map2_dbl(Variety1, Variety2, ~length(intersect(df$id[df$Type ==.x], 
                                                     df$id[df$Type ==.y])))) %>% 
  filter(Variety1 != Variety2) 

library(tidyr)
AA <- spread(intermediate_step, Variety2, counts)

...但是,出现了两个大问题

  1. intermediate_step 计算不正确
  2. 这种方法在计算上非常昂贵。对于这个玩具示例,它有效。对于我的真实数据(93k 个条目),RStudio 中止会话

... 第二个问题的可能解决方案 ...

关于如何以计算更高效的方式执行分析或如何应用我提出的解决方案的任何线索?

谢谢:)

您的数据似乎有误:

使用正确的数据 - 即第 2 行第 2 列应该是 4 而不是 2 df[2,2 <- 4,你可以这样做:

`diag<-`(crossprod(table(df)), NA)

    type
type  1  2  3  4  5  6
   1 NA  0  1  2  1  1
   2  0 NA  1  0  0  1
   3  1  1 NA  1  0  2
   4  2  0  1 NA  1  1
   5  1  0  0  1 NA  0
   6  1  1  2  1  0 NA