Doc2Vec build_vocab 方法失败
Doc2Vec build_vocab method fails
我正在关注 this guide 构建 Doc2Vec gensim
模型。
我创建了一个 MRE 来强调这个问题:
import pandas as pd, numpy as np, warnings, nltk, string, re, gensim
from tqdm import tqdm
tqdm.pandas(desc="progress-bar")
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from sklearn.model_selection import train_test_split
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
def get_words(para):
pattern = '([\d]|[\d][\d])\/([\d]|[\d][\d]\/([\d]{4}))'
stop_words = set(stopwords.words('english'))
stemmer = SnowballStemmer('english')
no_dates = [re.sub(pattern, '', i) for i in para.lower().split()]
no_punctuation = [nopunc.translate(str.maketrans('', '', string.punctuation)) for nopunc in no_dates]
stemmed_tokens = [stemmer.stem(word) for word in no_punctuation if word.strip() and len(word) > 1 and word not in stop_words]
return stemmed_tokens
data_dict = {'ID': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10},
'Review': {0: "Even though the restauraunt was gross, the food was still good and I'd recommend it",
1: 'My waiter was awful, my food was awful, I hate it all',
2: 'I did not enjoy the food very much but I thought the waitstaff was fantastic',
3: 'Even though the cleanliness level was fantastic, my food was awful',
4: 'Everything was mediocre, but I guess mediocre is better than bad nowadays',
5: "Honestly there wasn't a single thing that was mediocre about this place",
6: 'I could not have enjoyed it more! Perfect',
7: 'This place is perfectly awful. I think it should shut down to be honest',
8: "I can't understand how anyone would say something negative",
9: "It killed me. I'm writing this review as a ghost. That's how bad it was."},
'Bogus Field 1': {0: 'foo71',
1: 'foo92',
2: 'foo25',
3: 'foo88',
4: 'foo54',
5: 'foo10',
6: 'foo48',
7: 'foo76',
8: 'foo4',
9: 'foo11'},
'Bogus Field 2': {0: 'foo12',
1: 'foo66',
2: 'foo94',
3: 'foo90',
4: 'foo97',
5: 'foo87',
6: 'foo10',
7: 'foo4',
8: 'foo16',
9: 'foo86'},
'Sentiment': {0: 1, 1: 0, 2: 1, 3: 0, 4: 1, 5: 0, 6: 1, 7: 0, 8: 1, 9: 0}}
df = pd.DataFrame(data_dict, columns=data_dict.keys())
train, test = train_test_split(df, test_size=0.3, random_state=8)
train_tagged = train.apply(lambda x: TaggedDocument(words=get_words(x['Review']),
tags=x['Sentiment']), axis=1,)
model_dbow = Doc2Vec(dm=0, vector_size=50, negative=5, hs=0, min_count=1, sample=0, workers=8)
model_dbow.build_vocab([x for x in train_tagged.values])
产生:
--------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-18-590096b99bf9> in <module>
----> 1 model_dbow.build_vocab([x for x in train_tagged.values])
c:\python367-64\lib\site-packages\gensim\models\doc2vec.py in build_vocab(self, documents, corpus_file, update, progress_per, keep_raw_vocab, trim_rule, **kwargs)
926 total_words, corpus_count = self.vocabulary.scan_vocab(
927 documents=documents, corpus_file=corpus_file, docvecs=self.docvecs,
--> 928 progress_per=progress_per, trim_rule=trim_rule
929 )
930 self.corpus_count = corpus_count
c:\python367-64\lib\site-packages\gensim\models\doc2vec.py in scan_vocab(self, documents, corpus_file, docvecs, progress_per, trim_rule)
1123 documents = TaggedLineDocument(corpus_file)
1124
-> 1125 total_words, corpus_count = self._scan_vocab(documents, docvecs, progress_per, trim_rule)
1126
1127 logger.info(
c:\python367-64\lib\site-packages\gensim\models\doc2vec.py in _scan_vocab(self, documents, docvecs, progress_per, trim_rule)
1069 document_length = len(document.words)
1070
-> 1071 for tag in document.tags:
1072 _note_doctag(tag, document_length, docvecs)
1073
TypeError: 'int' object is not iterable
我不明白 int
类型是从哪里来的,因为:
print(set([type(x) for x in train_tagged]))
产量:{<class 'gensim.models.doc2vec.TaggedDocument'>}
请注意,其他问题排查如:
train_tagged = train.apply(lambda x: TaggedDocument(words=[get_words(x['Review'])],
tags=[x['Sentiment']]), axis=1,)
产量:
--------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-25-7bd5804d8d95> in <module>
----> 1 model_dbow.build_vocab(train_tagged)
c:\python367-64\lib\site-packages\gensim\models\doc2vec.py in build_vocab(self, documents, corpus_file, update, progress_per, keep_raw_vocab, trim_rule, **kwargs)
926 total_words, corpus_count = self.vocabulary.scan_vocab(
927 documents=documents, corpus_file=corpus_file, docvecs=self.docvecs,
--> 928 progress_per=progress_per, trim_rule=trim_rule
929 )
930 self.corpus_count = corpus_count
c:\python367-64\lib\site-packages\gensim\models\doc2vec.py in scan_vocab(self, documents, corpus_file, docvecs, progress_per, trim_rule)
1123 documents = TaggedLineDocument(corpus_file)
1124
-> 1125 total_words, corpus_count = self._scan_vocab(documents, docvecs, progress_per, trim_rule)
1126
1127 logger.info(
c:\python367-64\lib\site-packages\gensim\models\doc2vec.py in _scan_vocab(self, documents, docvecs, progress_per, trim_rule)
1073
1074 for word in document.words:
-> 1075 vocab[word] += 1
1076 total_words += len(document.words)
1077
TypeError: unhashable type: 'list'
您没有将任何文件传递给您的实际培训师,请参阅带有
的部分
model_dbow = Doc2Vec(dm=0 , [...])
此 0
被解释为整数,这就是您收到错误的原因。
相反,您应该简单地按照 gensim docs for Doc2Vec 中的详细说明添加您的文档,这样就可以了。
您的第一次尝试肯定是在 TaggedDocument
实例需要一个值列表的地方放置一个值——即使只有一个值列表。
我不确定您的第二次尝试出了什么问题,但您是否看过 train_tagged
的代表性实例,例如 train_tagged[0]
,以确保它是:
- 一个
TaggedDocument
tags
值为 list
- 其中该列表中的每个项目都是一个简单的字符串(或者在高级使用中,从
0
开始的范围内的 int
)
另请注意,if train_tagged
是正确的 TaggedDocument
实例序列,您可以而且应该将其直接传递给 build_vocab()
。 (不需要奇怪的 [x for x in train_tagged.values]
结构。)
更一般地说,如果刚开始使用 Doc2Vec
,从 Gensim 文档中更简单的示例开始会比“走向数据科学”中的内容更好。 “走向数据科学”中有大量非常糟糕的代码和错误的做法。
我正在关注 this guide 构建 Doc2Vec gensim
模型。
我创建了一个 MRE 来强调这个问题:
import pandas as pd, numpy as np, warnings, nltk, string, re, gensim
from tqdm import tqdm
tqdm.pandas(desc="progress-bar")
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from sklearn.model_selection import train_test_split
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
def get_words(para):
pattern = '([\d]|[\d][\d])\/([\d]|[\d][\d]\/([\d]{4}))'
stop_words = set(stopwords.words('english'))
stemmer = SnowballStemmer('english')
no_dates = [re.sub(pattern, '', i) for i in para.lower().split()]
no_punctuation = [nopunc.translate(str.maketrans('', '', string.punctuation)) for nopunc in no_dates]
stemmed_tokens = [stemmer.stem(word) for word in no_punctuation if word.strip() and len(word) > 1 and word not in stop_words]
return stemmed_tokens
data_dict = {'ID': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10},
'Review': {0: "Even though the restauraunt was gross, the food was still good and I'd recommend it",
1: 'My waiter was awful, my food was awful, I hate it all',
2: 'I did not enjoy the food very much but I thought the waitstaff was fantastic',
3: 'Even though the cleanliness level was fantastic, my food was awful',
4: 'Everything was mediocre, but I guess mediocre is better than bad nowadays',
5: "Honestly there wasn't a single thing that was mediocre about this place",
6: 'I could not have enjoyed it more! Perfect',
7: 'This place is perfectly awful. I think it should shut down to be honest',
8: "I can't understand how anyone would say something negative",
9: "It killed me. I'm writing this review as a ghost. That's how bad it was."},
'Bogus Field 1': {0: 'foo71',
1: 'foo92',
2: 'foo25',
3: 'foo88',
4: 'foo54',
5: 'foo10',
6: 'foo48',
7: 'foo76',
8: 'foo4',
9: 'foo11'},
'Bogus Field 2': {0: 'foo12',
1: 'foo66',
2: 'foo94',
3: 'foo90',
4: 'foo97',
5: 'foo87',
6: 'foo10',
7: 'foo4',
8: 'foo16',
9: 'foo86'},
'Sentiment': {0: 1, 1: 0, 2: 1, 3: 0, 4: 1, 5: 0, 6: 1, 7: 0, 8: 1, 9: 0}}
df = pd.DataFrame(data_dict, columns=data_dict.keys())
train, test = train_test_split(df, test_size=0.3, random_state=8)
train_tagged = train.apply(lambda x: TaggedDocument(words=get_words(x['Review']),
tags=x['Sentiment']), axis=1,)
model_dbow = Doc2Vec(dm=0, vector_size=50, negative=5, hs=0, min_count=1, sample=0, workers=8)
model_dbow.build_vocab([x for x in train_tagged.values])
产生:
--------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-18-590096b99bf9> in <module>
----> 1 model_dbow.build_vocab([x for x in train_tagged.values])
c:\python367-64\lib\site-packages\gensim\models\doc2vec.py in build_vocab(self, documents, corpus_file, update, progress_per, keep_raw_vocab, trim_rule, **kwargs)
926 total_words, corpus_count = self.vocabulary.scan_vocab(
927 documents=documents, corpus_file=corpus_file, docvecs=self.docvecs,
--> 928 progress_per=progress_per, trim_rule=trim_rule
929 )
930 self.corpus_count = corpus_count
c:\python367-64\lib\site-packages\gensim\models\doc2vec.py in scan_vocab(self, documents, corpus_file, docvecs, progress_per, trim_rule)
1123 documents = TaggedLineDocument(corpus_file)
1124
-> 1125 total_words, corpus_count = self._scan_vocab(documents, docvecs, progress_per, trim_rule)
1126
1127 logger.info(
c:\python367-64\lib\site-packages\gensim\models\doc2vec.py in _scan_vocab(self, documents, docvecs, progress_per, trim_rule)
1069 document_length = len(document.words)
1070
-> 1071 for tag in document.tags:
1072 _note_doctag(tag, document_length, docvecs)
1073
TypeError: 'int' object is not iterable
我不明白 int
类型是从哪里来的,因为:
print(set([type(x) for x in train_tagged]))
产量:{<class 'gensim.models.doc2vec.TaggedDocument'>}
请注意,其他问题排查如:
train_tagged = train.apply(lambda x: TaggedDocument(words=[get_words(x['Review'])],
tags=[x['Sentiment']]), axis=1,)
产量:
--------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-25-7bd5804d8d95> in <module>
----> 1 model_dbow.build_vocab(train_tagged)
c:\python367-64\lib\site-packages\gensim\models\doc2vec.py in build_vocab(self, documents, corpus_file, update, progress_per, keep_raw_vocab, trim_rule, **kwargs)
926 total_words, corpus_count = self.vocabulary.scan_vocab(
927 documents=documents, corpus_file=corpus_file, docvecs=self.docvecs,
--> 928 progress_per=progress_per, trim_rule=trim_rule
929 )
930 self.corpus_count = corpus_count
c:\python367-64\lib\site-packages\gensim\models\doc2vec.py in scan_vocab(self, documents, corpus_file, docvecs, progress_per, trim_rule)
1123 documents = TaggedLineDocument(corpus_file)
1124
-> 1125 total_words, corpus_count = self._scan_vocab(documents, docvecs, progress_per, trim_rule)
1126
1127 logger.info(
c:\python367-64\lib\site-packages\gensim\models\doc2vec.py in _scan_vocab(self, documents, docvecs, progress_per, trim_rule)
1073
1074 for word in document.words:
-> 1075 vocab[word] += 1
1076 total_words += len(document.words)
1077
TypeError: unhashable type: 'list'
您没有将任何文件传递给您的实际培训师,请参阅带有
的部分model_dbow = Doc2Vec(dm=0 , [...])
此 0
被解释为整数,这就是您收到错误的原因。
相反,您应该简单地按照 gensim docs for Doc2Vec 中的详细说明添加您的文档,这样就可以了。
您的第一次尝试肯定是在 TaggedDocument
实例需要一个值列表的地方放置一个值——即使只有一个值列表。
我不确定您的第二次尝试出了什么问题,但您是否看过 train_tagged
的代表性实例,例如 train_tagged[0]
,以确保它是:
- 一个
TaggedDocument
tags
值为list
- 其中该列表中的每个项目都是一个简单的字符串(或者在高级使用中,从
0
开始的范围内的int
)
另请注意,if train_tagged
是正确的 TaggedDocument
实例序列,您可以而且应该将其直接传递给 build_vocab()
。 (不需要奇怪的 [x for x in train_tagged.values]
结构。)
更一般地说,如果刚开始使用 Doc2Vec
,从 Gensim 文档中更简单的示例开始会比“走向数据科学”中的内容更好。 “走向数据科学”中有大量非常糟糕的代码和错误的做法。