如何在 most_similar 20 个案例的 word2vec(从 gensim 创建)上绘制 tsne?
How to plot tsne on word2vec (created from gensim) for the most_similar 20 cases?
我正在使用 TSNE 绘制经过训练的 word2vec 模型(从 gensim 创建):
labels = []
tokens = []
for word in model.wv.vocab:
tokens.append(model[word])
labels.append(word)
tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=2500, random_state=23)
new_values = tsne_model.fit_transform(tokens)
x = []
y = []
for value in new_values:
x.append(value[0])
y.append(value[1])
plt.figure(figsize=(50, 50))
for i in range(len(x)):
plt.scatter(x[i],y[i])
plt.annotate(labels[i],
xy=(x[i], y[i]),
xytext=(5, 2),
textcoords='offset points',
ha='right',
va='bottom')
plt.show()
与内置的 gensim 方法一样 'most_similar',例如
w2v_model.wv.most_similar(postive=['word'], topn=20)
将输出 20 个与 'word' 最相似的词,我只想绘制给定词中最相似的词 (n=20)。关于如何修改情节以做到这一点有什么建议吗?
使用包中的示例:
from gensim.test.utils import common_texts
from gensim.models import Word2Vec
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
model = Word2Vec(sentences=common_texts, window=5, min_count=1)
labels = [i for i in model.wv.vocab.keys()]
tokens = model[labels]
tsne_model = TSNE(init='pca',learning_rate='auto')
new_values = tsne_model.fit_transform(tokens)
tsne 看起来像这样:
plt.figure(figsize=(7, 5))
for i in range(new_values.shape[0]):
plt.scatter(x[i],y[i])
plt.annotate(labels[i],
xy=(x[i], y[i]),
xytext=(5, 2),
textcoords='offset points',
ha='right',
va='bottom')
提取最相似的 'trees'(在我的例子中是 5):
most_sim_words = [i[0] for i in model.wv.most_similar(positive='trees', topn=5)]
most_sim_words
['human', 'graph', 'time', 'interface', 'system']
您可以使用现有的代码,只需遍历最常见的单词,然后使用 index()
获取它们在 tokens
中的索引:
for word in most_sim_words:
i = labels.index(word)
plt.scatter(x[i],y[i])
plt.annotate(labels[i],
xy=(x[i], y[i]),
xytext=(5, 2),
textcoords='offset points',
ha='right',
va='bottom')
我正在使用 TSNE 绘制经过训练的 word2vec 模型(从 gensim 创建):
labels = []
tokens = []
for word in model.wv.vocab:
tokens.append(model[word])
labels.append(word)
tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=2500, random_state=23)
new_values = tsne_model.fit_transform(tokens)
x = []
y = []
for value in new_values:
x.append(value[0])
y.append(value[1])
plt.figure(figsize=(50, 50))
for i in range(len(x)):
plt.scatter(x[i],y[i])
plt.annotate(labels[i],
xy=(x[i], y[i]),
xytext=(5, 2),
textcoords='offset points',
ha='right',
va='bottom')
plt.show()
与内置的 gensim 方法一样 'most_similar',例如
w2v_model.wv.most_similar(postive=['word'], topn=20)
将输出 20 个与 'word' 最相似的词,我只想绘制给定词中最相似的词 (n=20)。关于如何修改情节以做到这一点有什么建议吗?
使用包中的示例:
from gensim.test.utils import common_texts
from gensim.models import Word2Vec
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
model = Word2Vec(sentences=common_texts, window=5, min_count=1)
labels = [i for i in model.wv.vocab.keys()]
tokens = model[labels]
tsne_model = TSNE(init='pca',learning_rate='auto')
new_values = tsne_model.fit_transform(tokens)
tsne 看起来像这样:
plt.figure(figsize=(7, 5))
for i in range(new_values.shape[0]):
plt.scatter(x[i],y[i])
plt.annotate(labels[i],
xy=(x[i], y[i]),
xytext=(5, 2),
textcoords='offset points',
ha='right',
va='bottom')
提取最相似的 'trees'(在我的例子中是 5):
most_sim_words = [i[0] for i in model.wv.most_similar(positive='trees', topn=5)]
most_sim_words
['human', 'graph', 'time', 'interface', 'system']
您可以使用现有的代码,只需遍历最常见的单词,然后使用 index()
获取它们在 tokens
中的索引:
for word in most_sim_words:
i = labels.index(word)
plt.scatter(x[i],y[i])
plt.annotate(labels[i],
xy=(x[i], y[i]),
xytext=(5, 2),
textcoords='offset points',
ha='right',
va='bottom')