管道中 CountVectorizer 的 Sklearn NotFittedError
Sklearn NotFittedError for CountVectorizer in pipeline
我正在尝试学习如何通过 sklearn 处理文本数据,运行遇到了一个我无法解决的问题。
我正在学习的教程是:http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
输入是一个有两列的 pandas df。一个带有文本,一个带有二进制 class.
代码:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
traindf, testdf = train_test_split(nlp_df, stratify=nlp_df['class'])
x_train = traindf['text']
x_test = traindf['text']
y_train = traindf['class']
y_test = testdf['class']
# CV
count_vect = CountVectorizer(stop_words='english')
x_train_modified = count_vect.fit_transform(x_train)
x_test_modified = count_vect.transform(x_test)
# TF-IDF
idf = TfidfTransformer()
fit = idf.fit(x_train_modified)
x_train_mod2 = fit.transform(x_train_modified)
# MNB
mnb = MultinomialNB()
x_train_data = mnb.fit(x_train_mod2, y_train)
text_clf = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB()),
])
predicted = text_clf.predict(x_test_modified)
当我尝试 运行 最后一行时:
---------------------------------------------------------------------------
NotFittedError Traceback (most recent call last)
<ipython-input-64-8815003b4713> in <module>()
----> 1 predicted = text_clf.predict(x_test_modified)
~/anaconda3/lib/python3.6/site-packages/sklearn/utils/metaestimators.py in <lambda>(*args, **kwargs)
113
114 # lambda, but not partial, allows help() to work with update_wrapper
--> 115 out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
116 # update the docstring of the returned function
117 update_wrapper(out, self.fn)
~/anaconda3/lib/python3.6/site-packages/sklearn/pipeline.py in predict(self, X)
304 for name, transform in self.steps[:-1]:
305 if transform is not None:
--> 306 Xt = transform.transform(Xt)
307 return self.steps[-1][-1].predict(Xt)
308
~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in transform(self, raw_documents)
918 self._validate_vocabulary()
919
--> 920 self._check_vocabulary()
921
922 # use the same matrix-building strategy as fit_transform
~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in _check_vocabulary(self)
301 """Check if vocabulary is empty or missing (not fit-ed)"""
302 msg = "%(name)s - Vocabulary wasn't fitted."
--> 303 check_is_fitted(self, 'vocabulary_', msg=msg),
304
305 if len(self.vocabulary_) == 0:
~/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in check_is_fitted(estimator, attributes, msg, all_or_any)
766
767 if not all_or_any([hasattr(estimator, attr) for attr in attributes]):
--> 768 raise NotFittedError(msg % {'name': type(estimator).__name__})
769
770
NotFittedError: CountVectorizer - Vocabulary wasn't fitted.
关于如何修复这个错误有什么建议吗?我正在根据测试数据正确转换 CV 模型。我什至检查了词汇列表是否为空并且它不是 (count_vect.vocabulary_)
谢谢!
你的问题有几个问题。
对于初学者来说,您实际上 适合 管道,因此出现错误。仔细观察 linked tutorial,您会看到有一个步骤 text_clf.fit
(其中 text_clf
确实是管道)。
其次,你没有正确使用管道的概念,这恰恰是为了端到端地适应整个东西;相反,您将它的各个组件一个一个地安装...如果您再次查看本教程,您会看到 管道的代码适合 :
text_clf.fit(twenty_train.data, twenty_train.target)
使用初始形式的数据,不他们的中间转换,就像你做的那样;本教程的重点是演示如何在管道中包装(并替换为)单个转换,不是在这些转换之上使用管道...
第三,你应该避免将变量命名为fit
——这是一个保留关键字;同样,我们不使用 CV 来缩写 Count Vectorizer(在 ML 术语中,CV 代表交叉验证)。
也就是说,这是使用管道的正确方法:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
traindf, testdf = train_test_split(nlp_df, stratify=nlp_df['class'])
x_train = traindf['text']
x_test = traindf['text']
y_train = traindf['class']
y_test = testdf['class']
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB()),
])
text_clf.fit(x_train, y_train)
predicted = text_clf.predict(x_test)
如您所见,管道的目的是使事情变得更简单(与按顺序一个接一个地使用组件相比),而不是使它们进一步复杂化...
我正在尝试学习如何通过 sklearn 处理文本数据,运行遇到了一个我无法解决的问题。
我正在学习的教程是:http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
输入是一个有两列的 pandas df。一个带有文本,一个带有二进制 class.
代码:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
traindf, testdf = train_test_split(nlp_df, stratify=nlp_df['class'])
x_train = traindf['text']
x_test = traindf['text']
y_train = traindf['class']
y_test = testdf['class']
# CV
count_vect = CountVectorizer(stop_words='english')
x_train_modified = count_vect.fit_transform(x_train)
x_test_modified = count_vect.transform(x_test)
# TF-IDF
idf = TfidfTransformer()
fit = idf.fit(x_train_modified)
x_train_mod2 = fit.transform(x_train_modified)
# MNB
mnb = MultinomialNB()
x_train_data = mnb.fit(x_train_mod2, y_train)
text_clf = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB()),
])
predicted = text_clf.predict(x_test_modified)
当我尝试 运行 最后一行时:
---------------------------------------------------------------------------
NotFittedError Traceback (most recent call last)
<ipython-input-64-8815003b4713> in <module>()
----> 1 predicted = text_clf.predict(x_test_modified)
~/anaconda3/lib/python3.6/site-packages/sklearn/utils/metaestimators.py in <lambda>(*args, **kwargs)
113
114 # lambda, but not partial, allows help() to work with update_wrapper
--> 115 out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
116 # update the docstring of the returned function
117 update_wrapper(out, self.fn)
~/anaconda3/lib/python3.6/site-packages/sklearn/pipeline.py in predict(self, X)
304 for name, transform in self.steps[:-1]:
305 if transform is not None:
--> 306 Xt = transform.transform(Xt)
307 return self.steps[-1][-1].predict(Xt)
308
~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in transform(self, raw_documents)
918 self._validate_vocabulary()
919
--> 920 self._check_vocabulary()
921
922 # use the same matrix-building strategy as fit_transform
~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in _check_vocabulary(self)
301 """Check if vocabulary is empty or missing (not fit-ed)"""
302 msg = "%(name)s - Vocabulary wasn't fitted."
--> 303 check_is_fitted(self, 'vocabulary_', msg=msg),
304
305 if len(self.vocabulary_) == 0:
~/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in check_is_fitted(estimator, attributes, msg, all_or_any)
766
767 if not all_or_any([hasattr(estimator, attr) for attr in attributes]):
--> 768 raise NotFittedError(msg % {'name': type(estimator).__name__})
769
770
NotFittedError: CountVectorizer - Vocabulary wasn't fitted.
关于如何修复这个错误有什么建议吗?我正在根据测试数据正确转换 CV 模型。我什至检查了词汇列表是否为空并且它不是 (count_vect.vocabulary_)
谢谢!
你的问题有几个问题。
对于初学者来说,您实际上 适合 管道,因此出现错误。仔细观察 linked tutorial,您会看到有一个步骤 text_clf.fit
(其中 text_clf
确实是管道)。
其次,你没有正确使用管道的概念,这恰恰是为了端到端地适应整个东西;相反,您将它的各个组件一个一个地安装...如果您再次查看本教程,您会看到 管道的代码适合 :
text_clf.fit(twenty_train.data, twenty_train.target)
使用初始形式的数据,不他们的中间转换,就像你做的那样;本教程的重点是演示如何在管道中包装(并替换为)单个转换,不是在这些转换之上使用管道...
第三,你应该避免将变量命名为fit
——这是一个保留关键字;同样,我们不使用 CV 来缩写 Count Vectorizer(在 ML 术语中,CV 代表交叉验证)。
也就是说,这是使用管道的正确方法:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
traindf, testdf = train_test_split(nlp_df, stratify=nlp_df['class'])
x_train = traindf['text']
x_test = traindf['text']
y_train = traindf['class']
y_test = testdf['class']
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB()),
])
text_clf.fit(x_train, y_train)
predicted = text_clf.predict(x_test)
如您所见,管道的目的是使事情变得更简单(与按顺序一个接一个地使用组件相比),而不是使它们进一步复杂化...