Doc2Vec 获取最相似的文档
Doc2Vec Get most similar documents
我正在尝试构建一个文档检索模型,该模型 return 大多数文档按其与查询或搜索字符串的相关性排序。为此,我使用 gensim 中的 Doc2Vec
模型训练了一个 doc2vec 模型。我的数据集采用 pandas 数据集的形式,每个文档在每一行上都存储为一个字符串。这是我目前的代码
import gensim, re
import pandas as pd
# TOKENIZER
def tokenizer(input_string):
return re.findall(r"[\w']+", input_string)
# IMPORT DATA
data = pd.read_csv('mp_1002_prepd.txt')
data.columns = ['merged']
data.loc[:, 'tokens'] = data.merged.apply(tokenizer)
sentences= []
for item_no, line in enumerate(data['tokens'].values.tolist()):
sentences.append(LabeledSentence(line,[item_no]))
# MODEL PARAMETERS
dm = 1 # 1 for distributed memory(default); 0 for dbow
cores = multiprocessing.cpu_count()
size = 300
context_window = 50
seed = 42
min_count = 1
alpha = 0.5
max_iter = 200
# BUILD MODEL
model = gensim.models.doc2vec.Doc2Vec(documents = sentences,
dm = dm,
alpha = alpha, # initial learning rate
seed = seed,
min_count = min_count, # ignore words with freq less than min_count
max_vocab_size = None, #
window = context_window, # the number of words before and after to be used as context
size = size, # is the dimensionality of the feature vector
sample = 1e-4, # ?
negative = 5, # ?
workers = cores, # number of cores
iter = max_iter # number of iterations (epochs) over the corpus)
# QUERY BASED DOC RANKING ??
我遇到困难的部分是查找最 similar/relevant 查询的文档。我使用了 infer_vector
但后来意识到它将查询视为文档,更新模型并 returns 结果。我尝试使用 most_similar
和 most_similar_cosmul
方法,但我在 return 中得到了单词和相似度分数(我猜)。我想要做的是当我输入搜索字符串(查询)时,我应该获得最相关的文档(id)以及相似性分数(余弦等)。我如何完成这部分?
您需要使用 infer_vector
来获取新文本的文档向量 - 这不会改变基础模型。
以下是操作方法:
tokens = "a new sentence to match".split()
new_vector = model.infer_vector(tokens)
sims = model.docvecs.most_similar([new_vector]) #gives you top 10 document tags and their cosine similarity
编辑:
下面是调用 infer_vec
后基础模型如何不变的示例。
import numpy as np
words = "king queen man".split()
len_before = len(model.docvecs) #number of docs
#word vectors for king, queen, man
w_vec0 = model[words[0]]
w_vec1 = model[words[1]]
w_vec2 = model[words[2]]
new_vec = model.infer_vector(words)
len_after = len(model.docvecs)
print np.array_equal(model[words[0]], w_vec0) # True
print np.array_equal(model[words[1]], w_vec1) # True
print np.array_equal(model[words[2]], w_vec2) # True
print len_before == len_after #True
我正在尝试构建一个文档检索模型,该模型 return 大多数文档按其与查询或搜索字符串的相关性排序。为此,我使用 gensim 中的 Doc2Vec
模型训练了一个 doc2vec 模型。我的数据集采用 pandas 数据集的形式,每个文档在每一行上都存储为一个字符串。这是我目前的代码
import gensim, re
import pandas as pd
# TOKENIZER
def tokenizer(input_string):
return re.findall(r"[\w']+", input_string)
# IMPORT DATA
data = pd.read_csv('mp_1002_prepd.txt')
data.columns = ['merged']
data.loc[:, 'tokens'] = data.merged.apply(tokenizer)
sentences= []
for item_no, line in enumerate(data['tokens'].values.tolist()):
sentences.append(LabeledSentence(line,[item_no]))
# MODEL PARAMETERS
dm = 1 # 1 for distributed memory(default); 0 for dbow
cores = multiprocessing.cpu_count()
size = 300
context_window = 50
seed = 42
min_count = 1
alpha = 0.5
max_iter = 200
# BUILD MODEL
model = gensim.models.doc2vec.Doc2Vec(documents = sentences,
dm = dm,
alpha = alpha, # initial learning rate
seed = seed,
min_count = min_count, # ignore words with freq less than min_count
max_vocab_size = None, #
window = context_window, # the number of words before and after to be used as context
size = size, # is the dimensionality of the feature vector
sample = 1e-4, # ?
negative = 5, # ?
workers = cores, # number of cores
iter = max_iter # number of iterations (epochs) over the corpus)
# QUERY BASED DOC RANKING ??
我遇到困难的部分是查找最 similar/relevant 查询的文档。我使用了 infer_vector
但后来意识到它将查询视为文档,更新模型并 returns 结果。我尝试使用 most_similar
和 most_similar_cosmul
方法,但我在 return 中得到了单词和相似度分数(我猜)。我想要做的是当我输入搜索字符串(查询)时,我应该获得最相关的文档(id)以及相似性分数(余弦等)。我如何完成这部分?
您需要使用 infer_vector
来获取新文本的文档向量 - 这不会改变基础模型。
以下是操作方法:
tokens = "a new sentence to match".split()
new_vector = model.infer_vector(tokens)
sims = model.docvecs.most_similar([new_vector]) #gives you top 10 document tags and their cosine similarity
编辑:
下面是调用 infer_vec
后基础模型如何不变的示例。
import numpy as np
words = "king queen man".split()
len_before = len(model.docvecs) #number of docs
#word vectors for king, queen, man
w_vec0 = model[words[0]]
w_vec1 = model[words[1]]
w_vec2 = model[words[2]]
new_vec = model.infer_vector(words)
len_after = len(model.docvecs)
print np.array_equal(model[words[0]], w_vec0) # True
print np.array_equal(model[words[1]], w_vec1) # True
print np.array_equal(model[words[2]], w_vec2) # True
print len_before == len_after #True