sklearn model data transform error: CountVectorizer - Vocabulary wasn't fitted
sklearn model data transform error: CountVectorizer - Vocabulary wasn't fitted
我已经训练了一个主题分类模型。然后当我要将新数据转换为向量进行预测时,它出错了。它显示 "NotFittedError: CountVectorizer - Vocabulary wasn't fitted." 但是当我通过在训练模型中将训练数据拆分为测试数据来进行预测时,它起作用了。这是代码:
from sklearn.externals import joblib
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import numpy as np
# read new dataset
testdf = pd.read_csv('C://Users/KW198/Documents/topic_model/training_data/testdata.csv', encoding='cp950')
testdf.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1800 entries, 0 to 1799
Data columns (total 2 columns):
keywords 1800 non-null object
topics 1800 non-null int64
dtypes: int64(1), object(1)
memory usage: 28.2+ KB
# read columns
kw = testdf['keywords']
label = testdf['topics']
# 將預測資料轉為向量
vectorizer = CountVectorizer(min_df=1, stop_words='english')
x_testkw_vec = vectorizer.transform(kw)
这里有一个错误
---------------------------------------------------------------------------
NotFittedError Traceback (most recent call last)
<ipython-input-93-cfcc7201e0f8> in <module>()
1 # 將預測資料轉為向量
2 vectorizer = CountVectorizer(min_df=1, stop_words='english')
----> 3 x_testkw_vec = vectorizer.transform(kw)
~\Anaconda3\envs\ztdl\lib\site-packages\sklearn\feature_extraction\text.py in transform(self, raw_documents)
918 self._validate_vocabulary()
919
--> 920 self._check_vocabulary()
921
922 # use the same matrix-building strategy as fit_transform
~\Anaconda3\envs\ztdl\lib\site-packages\sklearn\feature_extraction\text.py in _check_vocabulary(self)
301 """Check if vocabulary is empty or missing (not fit-ed)"""
302 msg = "%(name)s - Vocabulary wasn't fitted."
--> 303 check_is_fitted(self, 'vocabulary_', msg=msg),
304
305 if len(self.vocabulary_) == 0:
~\Anaconda3\envs\ztdl\lib\site-packages\sklearn\utils\validation.py in check_is_fitted(estimator, attributes, msg, all_or_any)
766
767 if not all_or_any([hasattr(estimator, attr) for attr in attributes]):
--> 768 raise NotFittedError(msg % {'name': type(estimator).__name__})
769
770
NotFittedError: CountVectorizer - Vocabulary wasn't fitted.
您需要调用结合了两者的 vectorizer.fit()
for the count vectorizer to build the dictionary of words before calling vectorizer.transform()
. You can also just call vectorizer.fit_transform()
。
但是您不应该使用新的矢量化器进行测试或任何类型的推理。您需要使用与训练模型时相同的那个,否则您的结果将是随机的,因为词汇表不同(缺少一些词,没有相同的对齐等。)
为此,您可以 pickle 训练中使用的矢量化器并在 inference/test 时间加载它。
我已经训练了一个主题分类模型。然后当我要将新数据转换为向量进行预测时,它出错了。它显示 "NotFittedError: CountVectorizer - Vocabulary wasn't fitted." 但是当我通过在训练模型中将训练数据拆分为测试数据来进行预测时,它起作用了。这是代码:
from sklearn.externals import joblib
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import numpy as np
# read new dataset
testdf = pd.read_csv('C://Users/KW198/Documents/topic_model/training_data/testdata.csv', encoding='cp950')
testdf.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1800 entries, 0 to 1799
Data columns (total 2 columns):
keywords 1800 non-null object
topics 1800 non-null int64
dtypes: int64(1), object(1)
memory usage: 28.2+ KB
# read columns
kw = testdf['keywords']
label = testdf['topics']
# 將預測資料轉為向量
vectorizer = CountVectorizer(min_df=1, stop_words='english')
x_testkw_vec = vectorizer.transform(kw)
这里有一个错误
---------------------------------------------------------------------------
NotFittedError Traceback (most recent call last)
<ipython-input-93-cfcc7201e0f8> in <module>()
1 # 將預測資料轉為向量
2 vectorizer = CountVectorizer(min_df=1, stop_words='english')
----> 3 x_testkw_vec = vectorizer.transform(kw)
~\Anaconda3\envs\ztdl\lib\site-packages\sklearn\feature_extraction\text.py in transform(self, raw_documents)
918 self._validate_vocabulary()
919
--> 920 self._check_vocabulary()
921
922 # use the same matrix-building strategy as fit_transform
~\Anaconda3\envs\ztdl\lib\site-packages\sklearn\feature_extraction\text.py in _check_vocabulary(self)
301 """Check if vocabulary is empty or missing (not fit-ed)"""
302 msg = "%(name)s - Vocabulary wasn't fitted."
--> 303 check_is_fitted(self, 'vocabulary_', msg=msg),
304
305 if len(self.vocabulary_) == 0:
~\Anaconda3\envs\ztdl\lib\site-packages\sklearn\utils\validation.py in check_is_fitted(estimator, attributes, msg, all_or_any)
766
767 if not all_or_any([hasattr(estimator, attr) for attr in attributes]):
--> 768 raise NotFittedError(msg % {'name': type(estimator).__name__})
769
770
NotFittedError: CountVectorizer - Vocabulary wasn't fitted.
您需要调用结合了两者的 vectorizer.fit()
for the count vectorizer to build the dictionary of words before calling vectorizer.transform()
. You can also just call vectorizer.fit_transform()
。
但是您不应该使用新的矢量化器进行测试或任何类型的推理。您需要使用与训练模型时相同的那个,否则您的结果将是随机的,因为词汇表不同(缺少一些词,没有相同的对齐等。)
为此,您可以 pickle 训练中使用的矢量化器并在 inference/test 时间加载它。