R:计算矩阵所有行之间成对匹配字符串的频率
R: Count the frequency of pairwise matching strings between all rows of a matrix
我在 R 中有一个 5000 x 1000 的字符矩阵,每个条目都是一种颜色(红色、蓝色、黄色、绿色等)。我想在所有列的矩阵的每一行之间以成对的方式计算匹配颜色(字符串)的频率。 1000 列中的每一列都呈现颜色标签的不同迭代,对每列不同标签的数量没有限制。例如,第一列可能有 8 个不同的颜色标签,而第二列有 10 个,第三列有 11 个,等等。我对标签本身不感兴趣,只有 频率 一对行匹配或不匹配每一列。
例如,我的字符矩阵看起来像这样(没有人工定期重复的颜色模式):
colors <- sample(c("grey", "green", "blue", "pink", "brown", "purple", "cyan", "red", "yellow"), 8, replace = TRUE)
labels <- matrix(rep(colors), nrow = 10, ncol = 5)
labels
[,1] [,2] [,3] [,4] [,5]
[1,] "brown" "purple" "yellow" "green" "brown"
[2,] "grey" "red" "brown" "red" "grey"
[3,] "purple" "yellow" "green" "brown" "purple"
[4,] "red" "brown" "red" "grey" "red"
[5,] "yellow" "green" "brown" "purple" "yellow"
[6,] "brown" "red" "grey" "red" "brown"
[7,] "green" "brown" "purple" "yellow" "green"
[8,] "red" "grey" "red" "brown" "red"
[9,] "brown" "purple" "yellow" "green" "brown"
[10,] "grey" "red" "brown" "red" "grey"
我想用它来构建一个 5000 x 5000 的正方形对称矩阵,它对应于行之间成对匹配的频率。每个条目 [i, j](以及 [j, i])应该是所有列中第 i 行和第 j 行之间的匹配频率。例如,在上面的玩具标签矩阵中,第 1 行在第 1 列和第 5 列中与第 6 行匹配,但与其他列不匹配,因此我希望匹配频率 (2/5 = 0.4) 为条目 [1, 6]和“频率矩阵”的 [6, 1]。对角线将全为 1,因为每一行总是与自身匹配。类似这样的输出:
freq.mat
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1 0 0 0 0 0.4 0 0 1 0
[2,] 0 1 0 0 0.2 0.4 0 0 0 1
[3,] 0 0 1 0 0 0 0 0.2 0 0
[4,] 0 0 0 1 0 0 0.2 0.6 0 0
[5,] 0 0.2 0 0 1 0 0 0 0 0.2
[6,] 0.4 0.4 0 0 0 1 0 0 0.4 0.4
[7,] 0 0 0 0.2 0 0 1 0 0 0
[8,] 0 0 0.2 0.6 0 0 0 1 0 0
[9,] 1 0 0 0 0 0.4 0 0 1 0
[10,] 0 1 0 0 0.2 0.4 0 0 0 1
我尝试按如下方式应用 rowSums 函数:
freq.mat <- apply(labels, 1, function(x) rowSums(x == labels))
diag(freq.matrix) <- 1
freq.matrix / 10
生成了一个大小合适的矩阵,但条目不是我希望的成对行匹配频率。我也修补了一些嵌套的 for 循环,但没有取得太大进展,这也感觉非常“违背 R 编程精神”。
谁能给我指出正确的方向?非常感谢!
您正在比较错误的值:
apply(labels, 1, function(x) colMeans(x == t(labels)))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1.0 0.0 0.0 0.0 0.0 0.4 0.0 0.0 1.0 0.0
[2,] 0.0 1.0 0.0 0.0 0.2 0.4 0.0 0.0 0.0 1.0
[3,] 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.2 0.0 0.0
[4,] 0.0 0.0 0.0 1.0 0.0 0.0 0.2 0.6 0.0 0.0
[5,] 0.0 0.2 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.2
[6,] 0.4 0.4 0.0 0.0 0.0 1.0 0.0 0.0 0.4 0.4
[7,] 0.0 0.0 0.0 0.2 0.0 0.0 1.0 0.0 0.0 0.0
[8,] 0.0 0.0 0.2 0.6 0.0 0.0 0.0 1.0 0.0 0.0
[9,] 1.0 0.0 0.0 0.0 0.0 0.4 0.0 0.0 1.0 0.0
[10,] 0.0 1.0 0.0 0.0 0.2 0.4 0.0 0.0 0.0 1.0
我在 R 中有一个 5000 x 1000 的字符矩阵,每个条目都是一种颜色(红色、蓝色、黄色、绿色等)。我想在所有列的矩阵的每一行之间以成对的方式计算匹配颜色(字符串)的频率。 1000 列中的每一列都呈现颜色标签的不同迭代,对每列不同标签的数量没有限制。例如,第一列可能有 8 个不同的颜色标签,而第二列有 10 个,第三列有 11 个,等等。我对标签本身不感兴趣,只有 频率 一对行匹配或不匹配每一列。
例如,我的字符矩阵看起来像这样(没有人工定期重复的颜色模式):
colors <- sample(c("grey", "green", "blue", "pink", "brown", "purple", "cyan", "red", "yellow"), 8, replace = TRUE)
labels <- matrix(rep(colors), nrow = 10, ncol = 5)
labels
[,1] [,2] [,3] [,4] [,5]
[1,] "brown" "purple" "yellow" "green" "brown"
[2,] "grey" "red" "brown" "red" "grey"
[3,] "purple" "yellow" "green" "brown" "purple"
[4,] "red" "brown" "red" "grey" "red"
[5,] "yellow" "green" "brown" "purple" "yellow"
[6,] "brown" "red" "grey" "red" "brown"
[7,] "green" "brown" "purple" "yellow" "green"
[8,] "red" "grey" "red" "brown" "red"
[9,] "brown" "purple" "yellow" "green" "brown"
[10,] "grey" "red" "brown" "red" "grey"
我想用它来构建一个 5000 x 5000 的正方形对称矩阵,它对应于行之间成对匹配的频率。每个条目 [i, j](以及 [j, i])应该是所有列中第 i 行和第 j 行之间的匹配频率。例如,在上面的玩具标签矩阵中,第 1 行在第 1 列和第 5 列中与第 6 行匹配,但与其他列不匹配,因此我希望匹配频率 (2/5 = 0.4) 为条目 [1, 6]和“频率矩阵”的 [6, 1]。对角线将全为 1,因为每一行总是与自身匹配。类似这样的输出:
freq.mat
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1 0 0 0 0 0.4 0 0 1 0
[2,] 0 1 0 0 0.2 0.4 0 0 0 1
[3,] 0 0 1 0 0 0 0 0.2 0 0
[4,] 0 0 0 1 0 0 0.2 0.6 0 0
[5,] 0 0.2 0 0 1 0 0 0 0 0.2
[6,] 0.4 0.4 0 0 0 1 0 0 0.4 0.4
[7,] 0 0 0 0.2 0 0 1 0 0 0
[8,] 0 0 0.2 0.6 0 0 0 1 0 0
[9,] 1 0 0 0 0 0.4 0 0 1 0
[10,] 0 1 0 0 0.2 0.4 0 0 0 1
我尝试按如下方式应用 rowSums 函数:
freq.mat <- apply(labels, 1, function(x) rowSums(x == labels))
diag(freq.matrix) <- 1
freq.matrix / 10
生成了一个大小合适的矩阵,但条目不是我希望的成对行匹配频率。我也修补了一些嵌套的 for 循环,但没有取得太大进展,这也感觉非常“违背 R 编程精神”。
谁能给我指出正确的方向?非常感谢!
您正在比较错误的值:
apply(labels, 1, function(x) colMeans(x == t(labels)))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1.0 0.0 0.0 0.0 0.0 0.4 0.0 0.0 1.0 0.0
[2,] 0.0 1.0 0.0 0.0 0.2 0.4 0.0 0.0 0.0 1.0
[3,] 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.2 0.0 0.0
[4,] 0.0 0.0 0.0 1.0 0.0 0.0 0.2 0.6 0.0 0.0
[5,] 0.0 0.2 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.2
[6,] 0.4 0.4 0.0 0.0 0.0 1.0 0.0 0.0 0.4 0.4
[7,] 0.0 0.0 0.0 0.2 0.0 0.0 1.0 0.0 0.0 0.0
[8,] 0.0 0.0 0.2 0.6 0.0 0.0 0.0 1.0 0.0 0.0
[9,] 1.0 0.0 0.0 0.0 0.0 0.4 0.0 0.0 1.0 0.0
[10,] 0.0 1.0 0.0 0.0 0.2 0.4 0.0 0.0 0.0 1.0