Kmeans：术语出现在多个集群中？

Question

将 Kmeans 与 TF-IDF 向量化器结合使用是否有可能获得出现在多个集群中的项？

示例数据集如下：

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

我使用 TF-IDF 向量化器进行特征提取：

vectorizer = TfidfVectorizer(stop_words='english')
feature = vectorizer.fit_transform(documents)
true_k = 3
km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
km.fit(feature)
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
print "Top terms per cluster:"
for i in range(true_k):
    print "Cluster %d:" % i,
    for ind in order_centroids[i, :10]:
        print ' %s,' % terms[ind],
    print

当我使用 scikit-learn 的 KMeans 对文档进行聚类时，结果如下：

Top terms per cluster:
Cluster 0:  user,  eps,  interface,  human,  response,  time,  computer,  management,  engineering,  testing,
Cluster 1:  trees,  intersection,  paths,  random,  generation,  unordered,  binary,  graph,  interface,  human,
Cluster 2:  minors,  graph,  survey,  widths,  ordering,  quasi,  iv,  trees,  engineering,  eps,

我们可以看到一些术语出现在多个集群中（例如，graph 在集群 1 和 2 中，eps 在集群 0 和 2 中）。

聚类结果有误吗？还是可以接受，因为每个文档的上述术语的 tf-idf 分数不同？

Answer 1

我认为你对你正在尝试做的事情有点困惑。您使用的代码为您提供文档的聚类，而不是术语。这些术语是您要聚类的维度。

如果你想找到每个文档属于哪个集群，你只需要使用 predict 或 fit_predict 方法，如下所示：

vectorizer = TfidfVectorizer(stop_words='english')
feature = vectorizer.fit_transform(documents)
true_k = 3
km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
km.fit(feature)
for n in range(9):
    print("Doc %d belongs to cluster %d. " % (n, km.predict(feature[n])))

你得到：

Doc 0 belongs to cluster 2. 
Doc 1 belongs to cluster 1. 
Doc 2 belongs to cluster 2. 
Doc 3 belongs to cluster 2. 
Doc 4 belongs to cluster 1. 
Doc 5 belongs to cluster 0. 
Doc 6 belongs to cluster 0. 
Doc 7 belongs to cluster 0. 
Doc 8 belongs to cluster 1.

看看User Guide of Scikit-learn

Kmeans：术语出现在多个集群中？

Kmeans: Terms occurring in more than one cluster?

python

cluster-analysis

tf-idf

k-means

scikit-learn