来自gensim的单词共现矩阵

Question

构建 python gensim word2vec model 时，有没有办法查看文档到单词的矩阵？

输入 sentences = [['first', 'sentence'], ['second', 'sentence']] 我会看到类似 *:

的内容

      first  second  sentence
doc0    1       0        1
doc1    0       1        1

*我已经说明了 'human readable'，但我正在寻找 scipy（或其他）矩阵，索引到 model.wv.index2word。

而且，是否可以将其转换为词到词矩阵（以查看同时出现）？类似于：

          first  second  sentence
first       1       0        1
second      0       1        1  
sentence    1       1        2

我已经使用 CountVectorizer 实现了类似 word-word co-occurrence matrix 的功能。它运作良好。但是，我已经在我的管道中使用了 gensim，speed/code 简单性对我的用例很重要。

Answer 1

给定一个由单词列表组成的语料库，您要做的是创建一个 Gensim 词典，将您的语料库更改为 bag-of-words，然后创建您的矩阵：

from gensim.matutils import corpus2csc
from gensim.corpora import Dictionary

# somehow create your corpus

dct = Dictionary(corpus)
bow_corpus = [dct.doc2bow(line) for line in corpus]
term_doc_mat = corpus2csc(bow_corpus)

您的 term_doc_mat 是一个 Numpy 压缩稀疏矩阵。如果你想要一个 term-term 矩阵，你总是可以将它乘以它的转置，即：

import numpy as np
term_term_mat = np.dot(term_doc_mat, term_doc_mat.T)

Answer 2

doc-word 到 word-word 的转换比我原先想象的要复杂（至少对我而言）。 np.dot()是解题的关键，但我需要先敷面膜。我创建了一个更复杂的测试示例...

想象一个doc-word矩阵

#       word1  word2  word3
# doc0    3      4      2
# doc1    6      1      0
# doc3    8      0      4

在文档中出现了 word2，word1 出现了 9 次
在文档中出现了 word2，word2 出现了 5 次
在文档中出现了 word2，word3 出现了 2 次

所以，当我们完成后，我们应该得到类似下面的结果（或者它是相反的）。按列读取，word-word 矩阵变为：

#       word1  word2  word3
# word1  17      9     11
# word2   5      5      4
# word3   6      2      6

A 直 np.dot() 产品产量：

import numpy as np
doc2word = np.array([[3,4,2],[6,1,0],[8,0,4]])
np.dot(doc2word,doc2word.T)
# array([[29, 22, 32],
#        [22, 37, 48],
#        [32, 48, 80]])

这意味着 word1 与自身一起出现了 29 次。

但是，如果我不是先将 doc2word 乘以自身，而是首先构建一个掩码，我会更接近。然后我需要颠倒参数的顺序：

import numpy as np
doc2word = np.array([[3,4,2],[6,1,0],[8,0,4]])
# a mask where all values greater than 0 are true
# so when this is multiplied by the orig matrix, True = 1 and False = 0
doc2word_mask = doc2word > 0  

np.dot(doc2word.T, doc2word_mask)
# array([[17,  9, 11],
#        [ 5,  5,  4],
#        [ 6,  2,  6]])

这个问题我想了很久....

来自gensim的单词共现矩阵

word co-occurrence matrix from gensim

python

nlp

gensim