在 Torchtext 中使用 annoy 进行最近邻搜索

Using annoy with Torchtext for nearest neighbor search

我将 Torchtext 用于某些 NLP 任务,特别是使用内置嵌入。

我希望能够进行逆向向量搜索:生成噪声向量,找到最接近它的向量,然后将 "closest" 的词返回到噪声向量。

来自 torchtext docs,以下是将嵌入附加到内置数据集的方法:

from torchtext.vocab import GloVe
from torchtext import data

embedding = GloVe(name='6B', dim=100)

# Set up fields
TEXT = data.Field(lower=True, include_lengths=True, batch_first=True)
LABEL = data.Field(sequential=False, is_target=True)

# make splits for data
train, test = datasets.IMDB.splits(TEXT, LABEL)

# build the vocabulary
TEXT.build_vocab(train, vectors=embedding, max_size=100000)
LABEL.build_vocab(train)

# Get an example vector
embedding.get_vecs_by_tokens("germany")

然后我们可以建立烦人指数:

from annoy import AnnoyIndex

num_trees = 50

ann_index = AnnoyIndex(embedding_dims, 'angular')

# Iterate through each vector in the embedding and add it to the index
for vector_num, vector in enumerate(TEXT.vocab.vectors):
    ann_index.add_item(vector_num, vector) # Here's the catch: will vector_num correspond to torchtext.vocab.Vocab.itos?

ann_index.build(num_trees)

然后假设我想使用噪声向量检索单词:

# Get an existing vector
original_vec = embedding.get_vecs_by_tokens("germany")
# Add some noise to it
noise = generate_noise_vector(ndims=100)
noisy_vector = original_vec + noise
# Get the vector closest to the noisy vector
closest_item_idx = ann_index.get_nns_by_vector(noisy_vector, 1)[0]
# Get word from noisy item
noisy_word = TEXT.vocab.itos[closest_item_idx]

我的问题来自上面的最后两行:ann_index 是在 embedding 对象上使用 enumerate 构建的,它是一个 Torch 张量。

[vocab][2] 对象有自己的 itos 列表,给定索引 returns 一个词。

我的问题是:我能确定单词在 itos 列表中出现的顺序与 TEXT.vocab.vectors 中的顺序相同吗?如何将一个索引映射到另一个索引?

Can I be certain that the order in which words appear in the itos list, is the same as the order in TEXT.vocab.vectors?

是的。

Field class 将始终实例化一个 Vocab 对象(source), and since you are passing the pre-trained vectors to TEXT.build_vocab, the Vocab constructor will call load_vectors 函数。

if vectors is not None:
    self.load_vectors(vectors, unk_init=unk_init, cache=vectors_cache)

load_vectors中,vectors通过枚举itos中的词是filled

for i, token in enumerate(self.itos):
    start_dim = 0
    for v in vectors:
        end_dim = start_dim + v.dim
        self.vectors[i][start_dim:end_dim] = v[token.strip()]
        start_dim = end_dim
    assert(start_dim == tot_dim)

因此,您可以确定 itosvectors 的顺序相同。