对于给定的单词，预测集群并从集群中获取最近的单词

Question

我已经按照 link 给出的说明在 w2v 和 k-means 上训练了我的语料库。

https://ai.intelligentonlinetools.com/ml/k-means-clustering-example-word2vec/

我想做什么一种。查找给定单词的簇 ID b.从给定单词的集群中获取前 20 个最接近的单词。

我已经弄清楚如何在给定的集群中找到单词了。我想要的是在给定的集群中找出更接近我给定单词的单词。

感谢任何帮助。

Answer 1

您的链接指南及其给定的数据有点误导。你无法从仅仅 30 个词的语料库中获得有意义的 100 维词向量（gensim Word2Vec class 默认值）。这种模型的结果将毫无意义，对聚类或其他下游步骤毫无用处——因此任何旨在展示此过程并具有真实结果的教程都应该使用更多的数据。

如果您实际上使用了更多的数据，并且已经成功地聚类了单词，Word2Vec 模型的 most_similar() 函数将为您提供前 N 个（默认 10 个）最接近的单词任何给定的输入词。（具体来说，它们将作为 (word, cosine_similarity) 元组返回，按最高 cosine_similarity 排名。）

Word2Vec 模型当然不会注意到聚类的结果，因此您必须过滤这些结果以丢弃感兴趣的聚类之外的词。

我假设您有一些查找对象 cluster，cluster[word] 为您提供特定单词的簇 ID。（这可能是一个字典，或者是对提供的向量执行 KMeans 模型 predict() 的东西，等等。）而且，total_words 是模型中的单词总数。（比如：total_words = len(w2v_model.wv)。那么你的逻辑应该大致是这样的

target_cluster = cluster[target_word]
all_similars = w2v_model.wv.most_similar(target_word, topn=total_words)
in_cluster_similars = [sim for sim in all_similars 
                       if cluster[sim[0]] = target_cluster]

如果您只想要前 20 名的结果，请剪辑到 in_cluster_similars[:20]。

对于给定的单词，预测集群并从集群中获取最近的单词

For a given word, Predict the cluster and get the nearest words from the cluster

cluster-analysis

k-means

python-3.x

supervised-learning

word2vec