计算矩阵的相关性

Question

我有一个包含一些文档和其中所有单词的矩阵。该数字表示该词在文档中出现的次数。

| topic    | word1 | word2 | word3 | word4 | word5 |...
|----------|-------|-------|-------|-------|-------|
| politics | 5     | 2     | 4     | 0     | 1     |
| sports   | 2     | 0     | 1     | 1     | 6     |
| music    | 2     | 3     | 1     | 3     | 6     |
| movies   | 0     | 3     | 2     | 6     | 1     |
| history  | 4     | 6     | 2     | 3     | 3     |
|...

我想计算和可视化它们的相关性。所以说我想看看关于音乐的文档是否更类似于关于电影或政治等的文档

做的时候：

csv <- read.csv("documents.csv")
matrix <- data.matrix(csv)
cor(matrix)

我得到：

            topic       word1       word2       word3      word4      word5
topic  1.00000000  0.08111071 -0.94812244  0.00000000 -0.6868028  0.3779645
word1  0.08111071  1.00000000  0.21296184  0.62828086 -0.7687575 -0.1635038
word2 -0.94812244  0.21296184  1.00000000  0.09415545  0.4307761 -0.3032248
word3  0.00000000  0.62828086  0.09415545  1.00000000 -0.3546635 -0.8132501
word4 -0.68680282 -0.76875749  0.43077610 -0.35466345  1.0000000 -0.2249755
word5  0.37796447 -0.16350382 -0.30322482 -0.81325006 -0.2249755  1.0000000

实际上我不确定我是否得到了正确的结果以及如何解释它们。

更新：

> dput(csv)
structure(list(topic = structure(c(4L, 5L, 3L, 2L, 1L), .Label = c("history", 
"movies", "music", "politics", "sports"), class = "factor"), 
    word1 = c(5L, 2L, 2L, 0L, 4L), word2 = c(2L, 0L, 3L, 3L, 
    6L), word3 = c(4L, 1L, 1L, 2L, 2L), word4 = c(0L, 1L, 3L, 
    6L, 3L), word5 = c(1, 6, 6, 1, 3)), .Names = c("topic", "word1", 
"word2", "word3", "word4", "word5"), class = "data.frame", row.names = c(NA, 
-5L))


> dput(matrix)
structure(c(4, 5, 3, 2, 1, 5, 2, 2, 0, 4, 2, 0, 3, 3, 6, 4, 1, 
1, 2, 2, 0, 1, 3, 6, 3, 1, 6, 6, 1, 3), .Dim = 5:6, .Dimnames = list(
    NULL, c("topic", "word1", "word2", "word3", "word4", "word5"
    )))

Answer 1

您可能想要删除第一列并处理转置矩阵：

csv <- read.csv("documents.csv")

row.names(csv) <- csv[,1]

csv <- csv[,-1]

matrix <- as.matrix(csv)
cor(t(matrix))

计算矩阵的相关性

Computing correlation of matrix

csv

r

similarity

matrix

correlation