使用 spaCy 将词向量映射到最多 similar/closest 个词

Question

我正在使用 spaCy 作为主题建模解决方案的一部分，并且我有一种情况需要将派生词向量映射到词向量词汇表中的 "closest" 或 "most similar" 词.

我看到 gensim 有一个函数 (WordEmbeddingsKeyedVectors.similar_by_vector) 来计算这个，但我想知道 spaCy 是否有类似这样的东西来将向量映射到其词汇表中的单词 (nlp.vocab)？

Answer 1

经过一些实验，我发现了一个 scikit 函数（scikit.spatial.distance 中的 cdist），它在输入向量的向量 space 中找到一个 "close" 向量。

# Imports
from scipy.spatial import distance
import spaCy

# Load the spacy vocabulary
nlp = spacy.load("en_core_web_lg")

# Format the input vector for use in the distance function
# In this case we will artificially create a word vector from a real word ("frog")
# but any derived word vector could be used
input_word = "frog"
p = np.array([nlp.vocab[input_word].vector])

# Format the vocabulary for use in the distance function
ids = [x for x in nlp.vocab.vectors.keys()]
vectors = [nlp.vocab.vectors[x] for x in ids]
vectors = np.array(vectors)

# *** Find the closest word below ***
closest_index = distance.cdist(p, vectors).argmin()
word_id = ids[closest_index]
output_word = nlp.vocab[word_id].text
# output_word is identical, or very close, to the input word

Answer 2

关于此答案的警告。传统上单词相似度（在 gensim、spacy 和 nltk 中）使用余弦相似度，而默认情况下，scipy 的 cdist 使用欧氏距离。可以得到cosine distance 这与相似度不一样，但是它们是相关的。要复制 gensim 的计算，请将您的 cdist 调用更改为以下内容：

distance.cdist(p, vectors, metric='cosine').argmin()

但是，您还应注意，scipy 测量余弦距离，即 "backwards" 与余弦相似度的余弦距离，其中 "cosine dist" = 1 - cos x（x 是向量之间的角度），因此，对于 match/duplicate gensim 数字，您必须从一个中减去您的答案（当然，采用 MAX 参数——相似向量更接近 1）。这是一个非常细微的差异，但会引起很多混淆。

相似向量应该具有较大（接近1）的相似度，而距离较小（接近于零）。

余弦相似度可以为负（意味着向量方向相反）但它们的 DISTANCE 将为正（距离应该为正）。

来源： https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html

https://tedboy.github.io/nlps/generated/generated/gensim.models.Word2Vec.n_similarity.html#gensim.models.Word2Vec.n_similarity

同样在 spacy 中做相似度如下：

import spacy
nlp = spacy.load("en_core_web_md")
x = nlp("man")
y = nlp("king")
print(x.similarity(y))
print(x.similarity(x))

Answer 3

是的，spacy 有一个 API 方法来做到这一点，就像 KeyedVectors.similar_by_vector:

import numpy as np
import spacy

nlp = spacy.load("en_core_web_lg")

your_word = "king"

ms = nlp.vocab.vectors.most_similar(
    np.asarray([nlp.vocab.vectors[nlp.vocab.strings[your_word]]]), n=10)
words = [nlp.vocab.strings[w] for w in ms[0][0]]
distances = ms[2]
print(words)
['King', 'KIng', 'king', 'KING', 'kings', 'KINGS', 'Kings', 'PRINCE', 'Prince', 'prince']

（单词在 sm_core_web_lg 中未正确归一化，但您可以使用其他模型并观察更具代表性的输出）。

Answer 4

这是一个 similarity search 的示例，特征向量的维度为 300（32 位浮点数为 1.2kB）。

您可以将词向量存储在几何数据结构中，sklearn.neighbors.BallTree, to speed the search significantly while avoiding the high-dimensional losses associated with k-d trees (no speedup when the dimension exceeds ~100). These can be pickled and unpickled easily and held in memory if you need to avoid loading spaCy. See below for implementation details. Demo, source。

线性搜索的其他答案有效（如果您的任何向量为零，我会注意在使用余弦相似度时要小心），但对于大词汇表来说会很慢。 spaCy 的 en_core_web_lg 库有大约 680k 个带有词向量的词。由于每个单词通常只有几个字节，这可能会导致使用几 GB 的内存。

我们可以使搜索不区分大小写，并使用 word frequency table 删除不常用的词（从 v3.0 开始，spaCy 已内置 table 但您现在必须单独加载它们） trim 将词汇量减少到 ~100k 个单词。然而，搜索仍然是线性的，可能需要几秒钟，这可能不被接受table.

有 libraries 可以快速进行相似性搜索，但是它们安装起来可能非常麻烦和复杂，并且适用于 MB 或 GB 数量级的特征向量以及 GPU 加速和其他。

我们也可能不希望每次应用程序运行时总是加载整个 spaCy 词汇表，因此我们 pickle/unpickle 根据需要加载词汇表。

import spacy, numpy, pickle
import sklearn.neighbors as nbs

#load spaCy
nlp=spacy.load("en_core_web_lg")

#load lexeme probability table
lookups = load_lookups("en", ["lexeme_prob"])
nlp.vocab.lookups.add_table("lexeme_prob", lookups.get_table("lexeme_prob"))

#get lowercase words above frequency threshold with vectors, min_prob=-20
words = [word for word in nlp.vocab.strings if nlp.vocab.has_vector(word) and word.islower() and nlp.vocab[word].prob >= -18]
wordvecs = numpy.array([nlp.vocab.get_vector(word) for word in words])  #get wordvectors
tree = nbs.BallTree(wordvecs)  #create the balltree
dict = dict(zip(words,wordvecs))  #create word:vector dict

trim创建词汇表后，我们可以 pickle words、dict 和 balltree 并在需要时加载它，而无需再次加载 spaCy：

#pickle/unpickle the balltree if you don't want to load spaCy
with open('balltree.pkl', 'wb') as f:
        pickle.dump(tree,f,protocol=pickle.HIGHEST_PROTOCOL)
#...
#load wordvector balltree from pickle file
with open('./balltree.pkl','rb') as f:
    tree = pickle.load(f)

给定一个词，获取它的词向量，在树中搜索最接近的词的索引，然后用字典查找该词：

#get wordvector and lookup nearest words
def nearest_words(word):
    #get vectors for all words
        try:
            vec = to_vec[word]
        #if word is not in vocab, set to zero vector
        except KeyError:
            vec = numpy.zeros(300)

    #perform nearest neighbor search of wordvector vocabulary
    dist, ind = tree.query([vec],10)

    #lookup nearest words using indices from tree
    near_words = [vocab[i] for i in ind[0]]

    return near_words

Answer 5

# python -m spcay download en_core_web_md
import spacy
nlp = spacy.load('en_core_web_md')
word = 'overflow'
nwords = 10
doc = nlp(word)
vector = doc.vector
vect2word = lambda idx: nlp.vocab.strings[idx]
print([vect2word(simword) for simword in nlp.vocab.vectors.most_similar(vector.reshape(1,-1), n=nwords)[0][0]])

使用 spaCy 将词向量映射到最多 similar/closest 个词

Mapping word vector to the most similar/closest word using spaCy

nlp

word2vec

spacy

word-embedding