当我尝试使用朴素贝叶斯 / Python 进行预测时出现维度不匹配错误
Getting dimension mismatch error when i try to predict with naive bayes / Python
我创建了一个情绪脚本并使用朴素贝叶斯对评论进行分类。我训练并测试了我的模型并将其保存在 Pickle 对象中。现在我想在一个新数据集上执行我的预测,但我总是收到以下错误消息
raise ValueError('dimension mismatch') ValueError: dimension mismatch
在这一行弹出:
preds = nb.predict(transformed_review)[0]
谁能告诉我我做错了什么?我不明白错误。
这是我的脚本:
sno = SnowballStemmer("german")
stopwords = [word.decode('utf-8-sig') for word in stopwords.words('german')]
ta_review_files = glob.glob('C:/users/Documents/review?*.CSV')
review_akt_doc = max(ta_review_files, key=os.path.getctime
ta_review = pd.read_csv(review_akt_doc)
sentiment_de_class= ta_review
x = sentiment_de_class['REV']
y = sentiment_de_class['SENTIMENT']
def text_process(text):
nopunc = [char for char in text.decode('utf8') if char not in string.punctuation]
nopunc = ''.join(nopunc)
noDig = ''.join(filter(lambda x: not x.isdigit(), nopunc))
## stemming
stemmi = u''.join(sno.stem(unicode(x)) for x in noDig)
stop = [word for word in stemmi.split() if word.lower() not in stopwords]
stop = ' '.join(stop)
return [word for word in stemmi.split() if word.lower() not in stopwords]
######################
# Matrix
######################
bow_transformer = CountVectorizer(analyzer=text_process).fit(x)
x = bow_transformer.transform(x)
######################
# Train and test data
######################
x_train, x_test, y_train, y_test = train_test_split(x,y, random_state=101)
print 'starting training ..'
######################
## first use
######################
#nb = MultinomialNB().fit(x_train,y_train)
#file = open(sentiment_MNB_path + 'sentiment_MNB_model.pickle', 'wb')
## dump information to that file
#pickle.dump(nb, file)
######################
## after train
######################
file = open(sentiment_MNB_path + 'sentiment_MNB_model.pickle', 'rb')
nb = pickle.load(file)
predis = []
######################
# Classify
######################
cols = ['SENTIMENT_CLASSIFY']
for sentiment in sentiment_de_class['REV']:
transformed_review = bow_transformer.transform([sentiment])
preds = nb.predict(transformed_review)[0] ##right here I get the error
predis.append(preds)
df = pd.DataFrame(predis, columns=cols)
您也需要像保存 nb
一样保存 CountVectorizer 对象。
当你打电话时
CountVectorizer(analyzer=text_process).fit(x)
您正在对新数据重新训练 CountVectorizer,因此它找到的特征(词汇)将与训练时不同,因此保存的 nb
是在较早的特征上训练的抱怨关于维度不匹配。
最好将它们腌制在不同的文件中,但如果您愿意,可以将它们保存在同一个文件中。
在同一个对象中腌制:
file = open(sentiment_MNB_path + 'sentiment_MNB_model.pickle', 'wb')
pickle.dump(bow_transformer, file) <=== Add this
pickle.dump(nb, file)
要在下次通话中阅读:
file = open(sentiment_MNB_path + 'sentiment_MNB_model.pickle', 'rb')
bow_transformer = pickle.load(file)
nb = pickle.load(file)
请查看此答案了解更多详情:
我创建了一个情绪脚本并使用朴素贝叶斯对评论进行分类。我训练并测试了我的模型并将其保存在 Pickle 对象中。现在我想在一个新数据集上执行我的预测,但我总是收到以下错误消息
raise ValueError('dimension mismatch') ValueError: dimension mismatch
在这一行弹出:
preds = nb.predict(transformed_review)[0]
谁能告诉我我做错了什么?我不明白错误。
这是我的脚本:
sno = SnowballStemmer("german")
stopwords = [word.decode('utf-8-sig') for word in stopwords.words('german')]
ta_review_files = glob.glob('C:/users/Documents/review?*.CSV')
review_akt_doc = max(ta_review_files, key=os.path.getctime
ta_review = pd.read_csv(review_akt_doc)
sentiment_de_class= ta_review
x = sentiment_de_class['REV']
y = sentiment_de_class['SENTIMENT']
def text_process(text):
nopunc = [char for char in text.decode('utf8') if char not in string.punctuation]
nopunc = ''.join(nopunc)
noDig = ''.join(filter(lambda x: not x.isdigit(), nopunc))
## stemming
stemmi = u''.join(sno.stem(unicode(x)) for x in noDig)
stop = [word for word in stemmi.split() if word.lower() not in stopwords]
stop = ' '.join(stop)
return [word for word in stemmi.split() if word.lower() not in stopwords]
######################
# Matrix
######################
bow_transformer = CountVectorizer(analyzer=text_process).fit(x)
x = bow_transformer.transform(x)
######################
# Train and test data
######################
x_train, x_test, y_train, y_test = train_test_split(x,y, random_state=101)
print 'starting training ..'
######################
## first use
######################
#nb = MultinomialNB().fit(x_train,y_train)
#file = open(sentiment_MNB_path + 'sentiment_MNB_model.pickle', 'wb')
## dump information to that file
#pickle.dump(nb, file)
######################
## after train
######################
file = open(sentiment_MNB_path + 'sentiment_MNB_model.pickle', 'rb')
nb = pickle.load(file)
predis = []
######################
# Classify
######################
cols = ['SENTIMENT_CLASSIFY']
for sentiment in sentiment_de_class['REV']:
transformed_review = bow_transformer.transform([sentiment])
preds = nb.predict(transformed_review)[0] ##right here I get the error
predis.append(preds)
df = pd.DataFrame(predis, columns=cols)
您也需要像保存 nb
一样保存 CountVectorizer 对象。
当你打电话时
CountVectorizer(analyzer=text_process).fit(x)
您正在对新数据重新训练 CountVectorizer,因此它找到的特征(词汇)将与训练时不同,因此保存的 nb
是在较早的特征上训练的抱怨关于维度不匹配。
最好将它们腌制在不同的文件中,但如果您愿意,可以将它们保存在同一个文件中。
在同一个对象中腌制:
file = open(sentiment_MNB_path + 'sentiment_MNB_model.pickle', 'wb')
pickle.dump(bow_transformer, file) <=== Add this
pickle.dump(nb, file)
要在下次通话中阅读:
file = open(sentiment_MNB_path + 'sentiment_MNB_model.pickle', 'rb')
bow_transformer = pickle.load(file)
nb = pickle.load(file)
请查看此答案了解更多详情: