GenSim Word2Vec 意外修剪
GenSim Word2Vec unexpectedly pruning
我的objective是寻找短语的向量表示。下面是我的代码,它部分适用于使用 Word2Vec model provided by the GenSim 库的双字母组。
from gensim.models import word2vec
def bigram2vec(unigrams, bigram_to_search):
bigrams = Phrases(unigrams)
model = word2vec.Word2Vec(sentences=bigrams[unigrams], size=20, min_count=1, window=4, sg=1, hs=1, negative=0, trim_rule=None)
if bigram_to_search in model.vocab.keys():
return model[bigram_to_search]
else:
return None
问题是 Word2Vec 模型似乎在自动修剪一些双字母组,即 len(model.vocab.keys()) != len(bigrams.vocab.keys())
。我试过调整trim_rule
、min_count
等各种参数,但似乎并没有影响剪枝。
PS - 我知道要查找的二元语法需要使用下划线而不是 space 来表示,即调用我的函数的正确方法是 bigram2vec(unigrams, 'this_report')
感谢 GenSim support forum, the solution is to set the appropriate min_count
and threshold
values for the Phrases
being generated (see documentation 对 Phrases
class 中这些参数的详细说明。更正后的解决方案代码如下。
from gensim.models import word2vec, Phrases
def bigram2vec(unigrams, bigram_to_search):
bigrams = Phrases(unigrams, min_count=1, threshold=0.1)
model = word2vec.Word2Vec(sentences=bigrams[unigrams], size=20, min_count=1, trim_rule=None)
if bigram_to_search in model.vocab.keys():
return model[bigram_to_search]
else:
return []
我的objective是寻找短语的向量表示。下面是我的代码,它部分适用于使用 Word2Vec model provided by the GenSim 库的双字母组。
from gensim.models import word2vec
def bigram2vec(unigrams, bigram_to_search):
bigrams = Phrases(unigrams)
model = word2vec.Word2Vec(sentences=bigrams[unigrams], size=20, min_count=1, window=4, sg=1, hs=1, negative=0, trim_rule=None)
if bigram_to_search in model.vocab.keys():
return model[bigram_to_search]
else:
return None
问题是 Word2Vec 模型似乎在自动修剪一些双字母组,即 len(model.vocab.keys()) != len(bigrams.vocab.keys())
。我试过调整trim_rule
、min_count
等各种参数,但似乎并没有影响剪枝。
PS - 我知道要查找的二元语法需要使用下划线而不是 space 来表示,即调用我的函数的正确方法是 bigram2vec(unigrams, 'this_report')
感谢 GenSim support forum, the solution is to set the appropriate min_count
and threshold
values for the Phrases
being generated (see documentation 对 Phrases
class 中这些参数的详细说明。更正后的解决方案代码如下。
from gensim.models import word2vec, Phrases
def bigram2vec(unigrams, bigram_to_search):
bigrams = Phrases(unigrams, min_count=1, threshold=0.1)
model = word2vec.Word2Vec(sentences=bigrams[unigrams], size=20, min_count=1, trim_rule=None)
if bigram_to_search in model.vocab.keys():
return model[bigram_to_search]
else:
return []