R:计算矩阵所有行之间成对匹配字符串的频率

R: Count the frequency of pairwise matching strings between all rows of a matrix

我在 R 中有一个 5000 x 1000 的字符矩阵,每个条目都是一种颜色(红色、蓝色、黄色、绿色等)。我想在所有列的矩阵的每一行之间以成对的方式计算匹配颜色(字符串)的频率。 1000 列中的每一列都呈现颜色标签的不同迭代,对每列不同标签的数量没有限制。例如,第一列可能有 8 个不同的颜色标签,而第二列有 10 个,第三列有 11 个,等等。我对标签本身不感兴趣,只有 频率 一对行匹配或不匹配每一列。

例如,我的字符矩阵看起来像这样(没有人工定期重复的颜色模式):

colors <- sample(c("grey", "green", "blue", "pink", "brown", "purple", "cyan", "red", "yellow"), 8, replace = TRUE)
labels <- matrix(rep(colors), nrow = 10, ncol = 5)
labels
     [,1]     [,2]     [,3]     [,4]     [,5]    
 [1,] "brown"  "purple" "yellow" "green"  "brown" 
 [2,] "grey"   "red"    "brown"  "red"    "grey"  
 [3,] "purple" "yellow" "green"  "brown"  "purple"
 [4,] "red"    "brown"  "red"    "grey"   "red"   
 [5,] "yellow" "green"  "brown"  "purple" "yellow"
 [6,] "brown"  "red"    "grey"   "red"    "brown" 
 [7,] "green"  "brown"  "purple" "yellow" "green" 
 [8,] "red"    "grey"   "red"    "brown"  "red"   
 [9,] "brown"  "purple" "yellow" "green"  "brown" 
[10,] "grey"   "red"    "brown"  "red"    "grey"  

我想用它来构建一个 5000 x 5000 的正方形对称矩阵,它对应于行之间成对匹配的频率。每个条目 [i, j](以及 [j, i])应该是所有列中第 i 行和第 j 行之间的匹配频率。例如,在上面的玩具标签矩阵中,第 1 行在第 1 列和第 5 列中与第 6 行匹配,但与其他列不匹配,因此我希望匹配频率 (2/5 = 0.4) 为条目 [1, 6]和“频率矩阵”的 [6, 1]。对角线将全为 1,因为每一行总是与自身匹配。类似这样的输出:

freq.mat
     [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9]  [,10]    
 [1,]  1     0     0     0     0    0.4    0     0     1      0
 [2,]  0     1     0     0    0.2   0.4    0     0     0      1     
 [3,]  0     0     1     0     0     0     0    0.2    0      0
 [4,]  0     0     0     1     0     0    0.2   0.6    0      0
 [5,]  0    0.2    0     0     1     0     0     0     0     0.2
 [6,] 0.4   0.4    0     0     0     1     0     0    0.4    0.4 
 [7,]  0     0     0    0.2    0     0     1     0     0      0 
 [8,]  0     0    0.2   0.6    0     0     0     1     0      0   
 [9,]  1     0     0     0     0    0.4    0     0     1      0 
[10,]  0     1     0     0    0.2   0.4    0     0     0      1 

我尝试按如下方式应用 rowSums 函数:

freq.mat <- apply(labels, 1, function(x) rowSums(x == labels))
diag(freq.matrix) <- 1
freq.matrix / 10

生成了一个大小合适的矩阵,但条目不是我希望的成对行匹配频率。我也修补了一些嵌套的 for 循环,但没有取得太大进展,这也感觉非常“违背 R 编程精神”。

谁能给我指出正确的方向?非常感谢!

您正在比较错误的值:

apply(labels, 1, function(x) colMeans(x == t(labels)))

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]  1.0  0.0  0.0  0.0  0.0  0.4  0.0  0.0  1.0   0.0
 [2,]  0.0  1.0  0.0  0.0  0.2  0.4  0.0  0.0  0.0   1.0
 [3,]  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.2  0.0   0.0
 [4,]  0.0  0.0  0.0  1.0  0.0  0.0  0.2  0.6  0.0   0.0
 [5,]  0.0  0.2  0.0  0.0  1.0  0.0  0.0  0.0  0.0   0.2
 [6,]  0.4  0.4  0.0  0.0  0.0  1.0  0.0  0.0  0.4   0.4
 [7,]  0.0  0.0  0.0  0.2  0.0  0.0  1.0  0.0  0.0   0.0
 [8,]  0.0  0.0  0.2  0.6  0.0  0.0  0.0  1.0  0.0   0.0
 [9,]  1.0  0.0  0.0  0.0  0.0  0.4  0.0  0.0  1.0   0.0
[10,]  0.0  1.0  0.0  0.0  0.2  0.4  0.0  0.0  0.0   1.0