Word2vec gensim - 使用短语时计算单词之间的相似度不起作用
Word2vec gensim - Calculating similarity between word isn't working when using phrases
使用gensim
word2vec
模型来计算两个词之间的相似度。用 250mb 的维基百科文本训练模型得到了一个很好的结果——一对相关词的相似度得分大约为 0.7-0.8。
问题是,当我使用 Phraser
模型来添加短语时,对于完全相同的单词,相似度得分几乎降为零。
短语模型的结果:
speed - velocity - 0.0203503432178
high - low - -0.0435703782446
tall - high - -0.0076987978333
nice - good - 0.0368784716958
computer - computational - 0.00487748035808
这可能意味着我没有正确使用 Phraser 模型。
我的代码:
data_set_location = **
sentences = SentenceIterator(data_set_location)
# Train phrase locator model
self.phraser = Phraser(Phrases(sentences))
# Renewing the iterator because its empty
sentences = SentenceIterator(data_set_location)
# Train word to vector model or load it from disk
self.model = Word2Vec(self.phraser[sentences], size=256, min_count=10, workers=10)
class SentenceIterator(object):
def __init__(self, dirname):
self.dirname = dirname
def __iter__(self):
for fname in os.listdir(self.dirname):
for line in open(os.path.join(self.dirname, fname), 'r', encoding='utf-8', errors='ignore'):
yield line.lower().split()
单独尝试解析器模型看起来工作正常:
>>>vectorizer.phraser['new', 'york', 'city', 'the', 'san', 'francisco']
['new_york', 'city', 'the', 'san_francisco']
什么会导致这种行为?
正在尝试找出解决方案:
根据 gojomo 的回答,我尝试创建一个 PhraserIterator
:
import os
class PhraseIterator(object):
def __init__(self, dirname, phraser):
self.dirname = dirname
self.phraser = phraser
def __iter__(self):
for fname in os.listdir(self.dirname):
for line in open(os.path.join(self.dirname, fname), 'r', encoding='utf-8', errors='ignore'):
yield self.phraser[line.lower()]
我尝试使用这个迭代器训练我的 Word2vec
模型。
phrase_iterator = PhraseIterator(text_dir, self.phraser)
self.model = Word2Vec(phrase_iterator, size=256, min_count=10, workers=10
Word2vec 训练日志:
Using TensorFlow backend.
2017-06-30 19:19:05,388 : INFO : collecting all words and their counts
2017-06-30 19:19:05,456 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types
2017-06-30 19:20:30,787 : INFO : collected 6227763 word types from a corpus of 28508701 words (unigram + bigrams) and 84 sentences
2017-06-30 19:20:30,793 : INFO : using 6227763 counts as vocab in Phrases<0 vocab, min_count=5, threshold=10.0, max_vocab_size=40000000>
2017-06-30 19:20:30,793 : INFO : source_vocab length 6227763
2017-06-30 19:21:46,573 : INFO : Phraser added 50000 phrasegrams
2017-06-30 19:22:22,015 : INFO : Phraser built with 70065 70065 phrasegrams
2017-06-30 19:22:23,089 : INFO : saving Phraser object under **/Models/word2vec/phrases_model, separately None
2017-06-30 19:22:23,441 : INFO : saved **/Models/word2vec/phrases_model
2017-06-30 19:22:23,442 : INFO : collecting all words and their counts
2017-06-30 19:22:29,347 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-06-30 19:33:06,667 : INFO : collected 143 word types from a corpus of 163438509 raw words and 84 sentences
2017-06-30 19:33:06,677 : INFO : Loading a fresh vocabulary
2017-06-30 19:33:06,678 : INFO : min_count=10 retains 95 unique words (66% of original 143, drops 48)
2017-06-30 19:33:06,679 : INFO : min_count=10 leaves 163438412 word corpus (99% of original 163438509, drops 97)
2017-06-30 19:33:06,683 : INFO : deleting the raw counts dictionary of 143 items
2017-06-30 19:33:06,683 : INFO : sample=0.001 downsamples 27 most-common words
2017-06-30 19:33:06,683 : INFO : downsampling leaves estimated 30341972 word corpus (18.6% of prior 163438412)
2017-06-30 19:33:06,684 : INFO : estimated required memory for 95 words and 256 dimensions: 242060 bytes
2017-06-30 19:33:06,685 : INFO : resetting layer weights
2017-06-30 19:33:06,724 : INFO : training model with 10 workers on 95 vocabulary and 256 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2017-06-30 19:33:14,974 : INFO : PROGRESS: at 0.00% examples, 0 words/s, in_qsize 0, out_qsize 0
2017-06-30 19:33:23,229 : INFO : PROGRESS: at 0.24% examples, 607 words/s, in_qsize 0, out_qsize 0
2017-06-30 19:33:31,445 : INFO : PROGRESS: at 0.48% examples, 810 words/s,
...
2017-06-30 20:19:00,864 : INFO : PROGRESS: at 98.57% examples, 1436 words/s, in_qsize 0, out_qsize 1
2017-06-30 20:19:06,193 : INFO : PROGRESS: at 99.05% examples, 1437 words/s, in_qsize 0, out_qsize 0
2017-06-30 20:19:11,886 : INFO : PROGRESS: at 99.29% examples, 1437 words/s, in_qsize 0, out_qsize 0
2017-06-30 20:19:17,648 : INFO : PROGRESS: at 99.52% examples, 1438 words/s, in_qsize 0, out_qsize 0
2017-06-30 20:19:22,870 : INFO : worker thread finished; awaiting finish of 9 more threads
2017-06-30 20:19:22,908 : INFO : worker thread finished; awaiting finish of 8 more threads
2017-06-30 20:19:22,947 : INFO : worker thread finished; awaiting finish of 7 more threads
2017-06-30 20:19:22,947 : INFO : PROGRESS: at 99.76% examples, 1439 words/s, in_qsize 0, out_qsize 8
2017-06-30 20:19:22,948 : INFO : worker thread finished; awaiting finish of 6 more threads
2017-06-30 20:19:22,948 : INFO : worker thread finished; awaiting finish of 5 more threads
2017-06-30 20:19:22,948 : INFO : worker thread finished; awaiting finish of 4 more threads
2017-06-30 20:19:22,948 : INFO : worker thread finished; awaiting finish of 3 more threads
2017-06-30 20:19:22,948 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-06-30 20:19:22,948 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-06-30 20:19:22,949 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-06-30 20:19:22,949 : INFO : training on 817192545 raw words (4004752 effective words) took 2776.2s, 1443 effective words/s
2017-06-30 20:19:22,950 : INFO : saving Word2Vec object under **/Models/word2vec/word2vec_model, separately None
2017-06-30 20:19:22,951 : INFO : not storing attribute syn0norm
2017-06-30 20:19:22,951 : INFO : not storing attribute cum_table
2017-06-30 20:19:22,958 : INFO : saved **/Models/word2vec/word2vec_model
经过这次训练 - 两个相似度计算中的任何一个产生零:
speed - velocity - 0
high - low - 0
所以迭代器似乎工作不正常所以我用 gojomo 技巧检查了它:
print(sum(1 for _ in s))
1
print(sum(1 for _ in s))
1
它的工作。
可能是什么问题?
首先,如果您的可迭代对象 class 工作正常——而且我觉得还不错——您就不需要 "renew the iterator because it's empty"。相反,它将能够被多次迭代。您可以测试它是否作为可迭代对象正常工作,而不是单个迭代,代码如下:
sentences = SentencesIterator(mypath)
print(sum(1 for _ in sentences))
print(sum(1 for _ in sentences))
如果相同的长度打印了两次,恭喜你,你有了一个真正的可迭代对象。 (您可能想更新 class 名称以反映这一点。)如果第二个长度是 0
,您只有一个迭代器:它可以使用一次,然后在后续尝试中为空. (如果是这样,请调整 class 代码,以便对 __iter__()
的每次调用都重新开始。但如上所述,我认为您的代码已经正确。)
这个题外话很重要,因为你的问题的真正原因是 self.phraser[sentences]
只是返回一个一次性迭代器对象,不是 一个可重复的可迭代对象.因此,Word2Vec 的第一个词汇发现步骤在其一次通过中消耗了整个语料库,然后所有的训练过程什么都看不到——并且没有训练发生。 (如果您有 INFO 级别的登录,这在输出中应该很明显,显示了没有示例的即时训练。)
尝试制作一个 PhraserIterable
class,它需要一个 phraser
和一个 sentences
,并且在每次调用 __iter__()
时开始一个新的、新鲜的通过设置。提供一个(已确认可重新启动的)实例作为 Word2Vec 的语料库。你应该看到训练需要更长的时间,因为它默认进行了 5 次传递——然后在以后的令牌比较中看到真实的结果。
另外:将原始 sentences
unigrams 即时升级为短语计算的 bigrams 的计算量可能很大。上面建议的方法意味着发生 6 次——词汇扫描然后 5 次训练通过。 运行-time 是一个问题,执行一次短语组合可能是有益的,将结果保存到内存中的对象(如果你的语料库很容易放入 RAM)或一个新的简单-space 分隔的中间结果文件,然后使用该文件作为 Word2Vec 模型的输入。
在 gojomo
的帮助下,这是有效的代码:
短语迭代器:
class PhraseIterator(object):
def __init__(self, phraser, sentences_iterator):
self.phraser = phraser
self.sentences_iterator = sentences_iterator
def __iter__(self):
yield self.phraser[self.sentences_iterator]
使用此迭代器产生错误:
Unhashable type list
所以我找到了这样使用它的解决方案:
from itertools import chain
phrase_iterator = PhraseIterator(self.phraser, sentences)
self.model = Word2Vec(list(chain(*phrase_iterator)), size=256, min_count=10, workers=10)
现在相似度计算效果很好(比以前好多了,没有措辞):
speed - velocity - 0.950267364305
high - low - 0.933983275802
tall - high - 0.858025875923
nice - good - 0.878882061037
computer - computational - 0.972395648333
使用gensim
word2vec
模型来计算两个词之间的相似度。用 250mb 的维基百科文本训练模型得到了一个很好的结果——一对相关词的相似度得分大约为 0.7-0.8。
问题是,当我使用 Phraser
模型来添加短语时,对于完全相同的单词,相似度得分几乎降为零。
短语模型的结果:
speed - velocity - 0.0203503432178
high - low - -0.0435703782446
tall - high - -0.0076987978333
nice - good - 0.0368784716958
computer - computational - 0.00487748035808
这可能意味着我没有正确使用 Phraser 模型。
我的代码:
data_set_location = **
sentences = SentenceIterator(data_set_location)
# Train phrase locator model
self.phraser = Phraser(Phrases(sentences))
# Renewing the iterator because its empty
sentences = SentenceIterator(data_set_location)
# Train word to vector model or load it from disk
self.model = Word2Vec(self.phraser[sentences], size=256, min_count=10, workers=10)
class SentenceIterator(object):
def __init__(self, dirname):
self.dirname = dirname
def __iter__(self):
for fname in os.listdir(self.dirname):
for line in open(os.path.join(self.dirname, fname), 'r', encoding='utf-8', errors='ignore'):
yield line.lower().split()
单独尝试解析器模型看起来工作正常:
>>>vectorizer.phraser['new', 'york', 'city', 'the', 'san', 'francisco']
['new_york', 'city', 'the', 'san_francisco']
什么会导致这种行为?
正在尝试找出解决方案:
根据 gojomo 的回答,我尝试创建一个 PhraserIterator
:
import os
class PhraseIterator(object):
def __init__(self, dirname, phraser):
self.dirname = dirname
self.phraser = phraser
def __iter__(self):
for fname in os.listdir(self.dirname):
for line in open(os.path.join(self.dirname, fname), 'r', encoding='utf-8', errors='ignore'):
yield self.phraser[line.lower()]
我尝试使用这个迭代器训练我的 Word2vec
模型。
phrase_iterator = PhraseIterator(text_dir, self.phraser)
self.model = Word2Vec(phrase_iterator, size=256, min_count=10, workers=10
Word2vec 训练日志:
Using TensorFlow backend.
2017-06-30 19:19:05,388 : INFO : collecting all words and their counts
2017-06-30 19:19:05,456 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types
2017-06-30 19:20:30,787 : INFO : collected 6227763 word types from a corpus of 28508701 words (unigram + bigrams) and 84 sentences
2017-06-30 19:20:30,793 : INFO : using 6227763 counts as vocab in Phrases<0 vocab, min_count=5, threshold=10.0, max_vocab_size=40000000>
2017-06-30 19:20:30,793 : INFO : source_vocab length 6227763
2017-06-30 19:21:46,573 : INFO : Phraser added 50000 phrasegrams
2017-06-30 19:22:22,015 : INFO : Phraser built with 70065 70065 phrasegrams
2017-06-30 19:22:23,089 : INFO : saving Phraser object under **/Models/word2vec/phrases_model, separately None
2017-06-30 19:22:23,441 : INFO : saved **/Models/word2vec/phrases_model
2017-06-30 19:22:23,442 : INFO : collecting all words and their counts
2017-06-30 19:22:29,347 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-06-30 19:33:06,667 : INFO : collected 143 word types from a corpus of 163438509 raw words and 84 sentences
2017-06-30 19:33:06,677 : INFO : Loading a fresh vocabulary
2017-06-30 19:33:06,678 : INFO : min_count=10 retains 95 unique words (66% of original 143, drops 48)
2017-06-30 19:33:06,679 : INFO : min_count=10 leaves 163438412 word corpus (99% of original 163438509, drops 97)
2017-06-30 19:33:06,683 : INFO : deleting the raw counts dictionary of 143 items
2017-06-30 19:33:06,683 : INFO : sample=0.001 downsamples 27 most-common words
2017-06-30 19:33:06,683 : INFO : downsampling leaves estimated 30341972 word corpus (18.6% of prior 163438412)
2017-06-30 19:33:06,684 : INFO : estimated required memory for 95 words and 256 dimensions: 242060 bytes
2017-06-30 19:33:06,685 : INFO : resetting layer weights
2017-06-30 19:33:06,724 : INFO : training model with 10 workers on 95 vocabulary and 256 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2017-06-30 19:33:14,974 : INFO : PROGRESS: at 0.00% examples, 0 words/s, in_qsize 0, out_qsize 0
2017-06-30 19:33:23,229 : INFO : PROGRESS: at 0.24% examples, 607 words/s, in_qsize 0, out_qsize 0
2017-06-30 19:33:31,445 : INFO : PROGRESS: at 0.48% examples, 810 words/s,
...
2017-06-30 20:19:00,864 : INFO : PROGRESS: at 98.57% examples, 1436 words/s, in_qsize 0, out_qsize 1
2017-06-30 20:19:06,193 : INFO : PROGRESS: at 99.05% examples, 1437 words/s, in_qsize 0, out_qsize 0
2017-06-30 20:19:11,886 : INFO : PROGRESS: at 99.29% examples, 1437 words/s, in_qsize 0, out_qsize 0
2017-06-30 20:19:17,648 : INFO : PROGRESS: at 99.52% examples, 1438 words/s, in_qsize 0, out_qsize 0
2017-06-30 20:19:22,870 : INFO : worker thread finished; awaiting finish of 9 more threads
2017-06-30 20:19:22,908 : INFO : worker thread finished; awaiting finish of 8 more threads
2017-06-30 20:19:22,947 : INFO : worker thread finished; awaiting finish of 7 more threads
2017-06-30 20:19:22,947 : INFO : PROGRESS: at 99.76% examples, 1439 words/s, in_qsize 0, out_qsize 8
2017-06-30 20:19:22,948 : INFO : worker thread finished; awaiting finish of 6 more threads
2017-06-30 20:19:22,948 : INFO : worker thread finished; awaiting finish of 5 more threads
2017-06-30 20:19:22,948 : INFO : worker thread finished; awaiting finish of 4 more threads
2017-06-30 20:19:22,948 : INFO : worker thread finished; awaiting finish of 3 more threads
2017-06-30 20:19:22,948 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-06-30 20:19:22,948 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-06-30 20:19:22,949 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-06-30 20:19:22,949 : INFO : training on 817192545 raw words (4004752 effective words) took 2776.2s, 1443 effective words/s
2017-06-30 20:19:22,950 : INFO : saving Word2Vec object under **/Models/word2vec/word2vec_model, separately None
2017-06-30 20:19:22,951 : INFO : not storing attribute syn0norm
2017-06-30 20:19:22,951 : INFO : not storing attribute cum_table
2017-06-30 20:19:22,958 : INFO : saved **/Models/word2vec/word2vec_model
经过这次训练 - 两个相似度计算中的任何一个产生零:
speed - velocity - 0
high - low - 0
所以迭代器似乎工作不正常所以我用 gojomo 技巧检查了它:
print(sum(1 for _ in s))
1
print(sum(1 for _ in s))
1
它的工作。
可能是什么问题?
首先,如果您的可迭代对象 class 工作正常——而且我觉得还不错——您就不需要 "renew the iterator because it's empty"。相反,它将能够被多次迭代。您可以测试它是否作为可迭代对象正常工作,而不是单个迭代,代码如下:
sentences = SentencesIterator(mypath)
print(sum(1 for _ in sentences))
print(sum(1 for _ in sentences))
如果相同的长度打印了两次,恭喜你,你有了一个真正的可迭代对象。 (您可能想更新 class 名称以反映这一点。)如果第二个长度是 0
,您只有一个迭代器:它可以使用一次,然后在后续尝试中为空. (如果是这样,请调整 class 代码,以便对 __iter__()
的每次调用都重新开始。但如上所述,我认为您的代码已经正确。)
这个题外话很重要,因为你的问题的真正原因是 self.phraser[sentences]
只是返回一个一次性迭代器对象,不是 一个可重复的可迭代对象.因此,Word2Vec 的第一个词汇发现步骤在其一次通过中消耗了整个语料库,然后所有的训练过程什么都看不到——并且没有训练发生。 (如果您有 INFO 级别的登录,这在输出中应该很明显,显示了没有示例的即时训练。)
尝试制作一个 PhraserIterable
class,它需要一个 phraser
和一个 sentences
,并且在每次调用 __iter__()
时开始一个新的、新鲜的通过设置。提供一个(已确认可重新启动的)实例作为 Word2Vec 的语料库。你应该看到训练需要更长的时间,因为它默认进行了 5 次传递——然后在以后的令牌比较中看到真实的结果。
另外:将原始 sentences
unigrams 即时升级为短语计算的 bigrams 的计算量可能很大。上面建议的方法意味着发生 6 次——词汇扫描然后 5 次训练通过。 运行-time 是一个问题,执行一次短语组合可能是有益的,将结果保存到内存中的对象(如果你的语料库很容易放入 RAM)或一个新的简单-space 分隔的中间结果文件,然后使用该文件作为 Word2Vec 模型的输入。
在 gojomo
的帮助下,这是有效的代码:
短语迭代器:
class PhraseIterator(object):
def __init__(self, phraser, sentences_iterator):
self.phraser = phraser
self.sentences_iterator = sentences_iterator
def __iter__(self):
yield self.phraser[self.sentences_iterator]
使用此迭代器产生错误:
Unhashable type list
所以我找到了这样使用它的解决方案:
from itertools import chain
phrase_iterator = PhraseIterator(self.phraser, sentences)
self.model = Word2Vec(list(chain(*phrase_iterator)), size=256, min_count=10, workers=10)
现在相似度计算效果很好(比以前好多了,没有措辞):
speed - velocity - 0.950267364305
high - low - 0.933983275802
tall - high - 0.858025875923
nice - good - 0.878882061037
computer - computational - 0.972395648333