Gensim DOC2VEC 修剪和删除词汇表
Gensim DOC2VEC trims and delete the vocabulary
我尝试创建一个简单的 Doc2Vec 模型:
sentences = []
sentences.append(doc2vec.TaggedDocument(words=[u'scarpe', u'rosse', u'con', u'tacco'], tags=[1]))
sentences.append(doc2vec.TaggedDocument(words=[u'scarpe', u'blu'], tags=[2]))
sentences.append(doc2vec.TaggedDocument(words=[u'scarponcini', u'Emporio', u'Armani'], tags=[3]))
sentences.append(doc2vec.TaggedDocument(words=[u'scarpe', u'marca', u'italiana'], tags=[4]))
sentences.append(doc2vec.TaggedDocument(words=[u'scarpe', u'bianche', u'senza', u'tacco'], tags=[5]))
model = Doc2Vec(alpha=0.025, min_alpha=0.025) # use fixed learning rate
model.build_vocab(sentences)
但我最终得到的是一个空词汇。通过一些调试,我看到在 build_vocab() 函数内部,字典实际上是由 vocabulary.scan_vocab() 函数创建的,但它被以下 vocabulary.prepare_vocab() 函数删除。更深入地说,这是导致问题的功能:
def keep_vocab_item(word, count, min_count, trim_rule=None):
"""Check that should we keep `word` in vocab or remove.
Parameters
----------
word : str
Input word.
count : int
Number of times that word contains in corpus.
min_count : int
Frequency threshold for `word`.
trim_rule : function, optional
Function for trimming entities from vocab, default behaviour is `vocab[w] <= min_reduce`.
Returns
-------
bool
True if `word` should stay, False otherwise.
"""
default_res = count >= min_count
if trim_rule is None:
return default_res # <-- ALWAYS RETURNS FALSE
else:
rule_res = trim_rule(word, count, min_count)
if rule_res == RULE_KEEP:
return True
elif rule_res == RULE_DISCARD:
return False
else:
return default_res
有人明白这个问题吗?
我自己找到了答案,min_count 的默认值是 5,我没有用 5 或更多的计数器的话。
我只需要更改这行代码:
model = Doc2Vec(min_count=0, alpha=0.025, min_alpha=0.025)
我尝试创建一个简单的 Doc2Vec 模型:
sentences = []
sentences.append(doc2vec.TaggedDocument(words=[u'scarpe', u'rosse', u'con', u'tacco'], tags=[1]))
sentences.append(doc2vec.TaggedDocument(words=[u'scarpe', u'blu'], tags=[2]))
sentences.append(doc2vec.TaggedDocument(words=[u'scarponcini', u'Emporio', u'Armani'], tags=[3]))
sentences.append(doc2vec.TaggedDocument(words=[u'scarpe', u'marca', u'italiana'], tags=[4]))
sentences.append(doc2vec.TaggedDocument(words=[u'scarpe', u'bianche', u'senza', u'tacco'], tags=[5]))
model = Doc2Vec(alpha=0.025, min_alpha=0.025) # use fixed learning rate
model.build_vocab(sentences)
但我最终得到的是一个空词汇。通过一些调试,我看到在 build_vocab() 函数内部,字典实际上是由 vocabulary.scan_vocab() 函数创建的,但它被以下 vocabulary.prepare_vocab() 函数删除。更深入地说,这是导致问题的功能:
def keep_vocab_item(word, count, min_count, trim_rule=None):
"""Check that should we keep `word` in vocab or remove.
Parameters
----------
word : str
Input word.
count : int
Number of times that word contains in corpus.
min_count : int
Frequency threshold for `word`.
trim_rule : function, optional
Function for trimming entities from vocab, default behaviour is `vocab[w] <= min_reduce`.
Returns
-------
bool
True if `word` should stay, False otherwise.
"""
default_res = count >= min_count
if trim_rule is None:
return default_res # <-- ALWAYS RETURNS FALSE
else:
rule_res = trim_rule(word, count, min_count)
if rule_res == RULE_KEEP:
return True
elif rule_res == RULE_DISCARD:
return False
else:
return default_res
有人明白这个问题吗?
我自己找到了答案,min_count 的默认值是 5,我没有用 5 或更多的计数器的话。 我只需要更改这行代码:
model = Doc2Vec(min_count=0, alpha=0.025, min_alpha=0.025)