K 表示在 n 维向量上进行聚类。

Question

我将 TFIDF 应用于文本文档，其中我获得了不同长度的 n 维向量，每个向量对应于一个文档。

    texts = [[token for token in text if frequency[token] > 1] for text in texts]
    dictionary = corpora.Dictionary(texts)
    corpus = [dictionary.doc2bow(text) for text in texts]
    lda = models.ldamodel.LdaModel(corpus, num_topics=100, id2word=dictionary)
    tfidf = models.TfidfModel(corpus)   
    corpus_tfidf = tfidf[corpus]
    lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=100)
    corpus_lsi = lsi[corpus_tfidf]
    corpus_lda=lda[corpus]
    print "TFIDF:"
    print corpus_tfidf[1]
    print "__________________________________________"
    print corpus_tfidf[2]

输出为：

TFIDF:
Vec1:    [(19, 0.06602704727889631), (32, 0.360417819987515), (33, 0.3078487494326974), (34, 0.360417819987515), (35, 0.2458968255872351), (36, 0.23680107692707422), (37, 0.29225639811281434), (38, 0.31741275088103), (39, 0.28571949457481044), (40, 0.32872456368129543), (41, 0.3855741727557306)]
    __________________________________________
Vec2:    [(5, 0.05617283528623041), (6, 0.10499864499395724), (8, 0.11265354901199849), (16, 0.028248249837939252), (19, 0.03948130674177094), (29, 0.07013501129200184), (33, 0.18408018239985235), (42, 0.14904146984986072), (43, 0.20484144632880313), (44, 0.215514203535732), (45, 0.15836501876891904), (46, 0.08505477582234795), (47, 0.07138425858136686), (48, 0.127695955436003), (49, 0.18408018239985235), (50, 0.2305566099597365), (51, 0.20484144632880313), (52, 0.2305566099597365), (53, 0.2305566099597365), (54, 0.053099690797234665), (55, 0.2305566099597365), (56, 0.2305566099597365), (57, 0.2305566099597365), (58, 0.0881162347543671), (59, 0.20484144632880313), (60, 0.16408387627386525), (61, 0.08256873616398946), (62, 0.215514203535732), (63, 0.2305566099597365), (64, 0.16731192344738707), (65, 0.2305566099597365), (66, 0.2305566099597365), (67, 0.07320703902661252), (68, 0.17912628269786976), (69, 0.12332630621892736)]

未表示的向量点为0。也就是说向量中不存在(18, ....)，则为0。

我想对这些向量（Vec1 和 Vec2）应用 K 均值聚类

Scikit 的 K 表示聚类需要等维和矩阵格式的向量。对此应该怎么办？

Answer 1

所以在查看源代码后，看起来 gensim 手动为每个文档创建了一个稀疏向量（这只是一个元组列表）。这使得错误有意义，因为 scikit-learn 的 kMeans 算法允许稀疏 scipy 矩阵，但它不知道如何解释 gensim 稀疏向量。您可以使用以下命令将这些单独的列表中的每一个转换为 scipy csr_matrix（一次转换所有文档会更好，但这是一个快速修复）。

rows = [0] * len(corpus_tfidf[1])
cols = [tup[0] for tup in corpus_tfidf[1]]
data = [tup[1] for tup in corpus_tfidf[1]]
sparse_vec = csr_matrix((data, (rows, cols)))

你应该可以使用这个 sparse_vec，但如果它抛出错误，你可以将它变成一个密集的 numpy 数组 .toarray() 或 numpy 矩阵 .todense() .

编辑：原来 Gensim 提供了一些漂亮的实用函数，包括一个采用流式语料库对象格式和 returns csc 矩阵的函数。这是您的代码如何工作的完整示例（连接到 sklearn 的 kMeans 聚类算法）

from gensim import corpora, models, matutils
from sklearn.cluster import KMeans

texts = [[token for token in text] for text in texts]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]

print "TFIDF:"
corpus_tfidf = matutils.corpus2csc(corpus_tfidf).transpose()
print corpus_tfidf
print "__________________________________________"

kmeans = KMeans(n_clusters=2)
print kmeans.fit_predict(corpus_tfidf)

您应该计算并传递进入 corpus2csc 的附加参数，因为它可以根据语料库的大小节省您的周期。我们转置矩阵，因为 gensim 将文档放在列中，将术语放在行中。您可以根据您的用例（除了 kmeans 聚类）将 scipy 稀疏矩阵转换为无数其他类型。

K 表示在 n 维向量上进行聚类。

K means Clustering on n dimensional vectors.

python

k-means

gensim

scikit-learn