Sklearn Pipeline 和原始模型的答案不一样 "Fixed Output"
Sklearn Pipeline and original model aren't the same answer "Fixed Output"
我正在为 SMS 开发简单文本 class化,完整模型将分为 3 个步骤:
- TextCleaning()“自定义函数”
- TfidfVectorizer() “向量化器”
- MultinomialNB() "分类模型"
我想使用 sklearn.pipeline
将 3 个步骤合并到一个模型中并使用 joblib.dump
保存模型,问题是当加载保存的模型时,输出是 fixed 每次使用 spam class 的任何测试或训练数据,我得到 ham!
这是Pipeline
之前的自定义函数:
def TextCleaning(X):
documents = []
for sent in X:
# Remove all single characters
sent = re.sub(r'\s+[a-zA-Z]\s+', ' ', sent)
# Substituting multiple spaces with single space
sent = re.sub(r'\s+', ' ', sent, flags=re.I)
doc = nlp(sent)
document = [token.lemma_ for token in doc]
document = ' '.join(document)
documents.append(document)
return documents
这是 TextCleaning 的编码,作为 class for Pipeline
:
class TextCleaning():
def __init__(self):
print("call init")
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
documents = []
for sent in X:
# Remove all single characters
sent = re.sub(r'\s+[a-zA-Z]\s+', ' ', sent)
# Substituting multiple spaces with single space
sent = re.sub(r'\s+', ' ', sent, flags=re.I)
doc = nlp(sent)
document = [token.lemma_ for token in doc]
document = ' '.join(document)
documents.append(document)
return documents
这是Pipeline
代码:
EmailClassification = Pipeline([('TextCleaning', TextCleaning()),
('Vectorizer', TfidfVectorizer(stop_words=STOP_WORDS)),
('NB', MultinomialNB())])
上的完整笔记本和数据
在你的笔记本中,你正在做:
EmailClassification.predict("Congratulations, you won @ free rolex")
如果您只是将数据作为字符串提供,管道会将其解释为字符列表并尝试预测每个字符,因此您得到的预测数量与字符串长度相同:
EmailClassification.predict("Congratulations, you won @ free rolex")
array(['ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham',
'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham',
'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham',
'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham',
'ham'], dtype='<U4')
应该是:
EmailClassification.predict(["Congratulations, you won @ free rolex"])
array(['spam'], dtype='<U4')
我正在为 SMS 开发简单文本 class化,完整模型将分为 3 个步骤:
- TextCleaning()“自定义函数”
- TfidfVectorizer() “向量化器”
- MultinomialNB() "分类模型"
我想使用 sklearn.pipeline
将 3 个步骤合并到一个模型中并使用 joblib.dump
保存模型,问题是当加载保存的模型时,输出是 fixed 每次使用 spam class 的任何测试或训练数据,我得到 ham!
这是Pipeline
之前的自定义函数:
def TextCleaning(X):
documents = []
for sent in X:
# Remove all single characters
sent = re.sub(r'\s+[a-zA-Z]\s+', ' ', sent)
# Substituting multiple spaces with single space
sent = re.sub(r'\s+', ' ', sent, flags=re.I)
doc = nlp(sent)
document = [token.lemma_ for token in doc]
document = ' '.join(document)
documents.append(document)
return documents
这是 TextCleaning 的编码,作为 class for Pipeline
:
class TextCleaning():
def __init__(self):
print("call init")
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
documents = []
for sent in X:
# Remove all single characters
sent = re.sub(r'\s+[a-zA-Z]\s+', ' ', sent)
# Substituting multiple spaces with single space
sent = re.sub(r'\s+', ' ', sent, flags=re.I)
doc = nlp(sent)
document = [token.lemma_ for token in doc]
document = ' '.join(document)
documents.append(document)
return documents
这是Pipeline
代码:
EmailClassification = Pipeline([('TextCleaning', TextCleaning()),
('Vectorizer', TfidfVectorizer(stop_words=STOP_WORDS)),
('NB', MultinomialNB())])
上的完整笔记本和数据
在你的笔记本中,你正在做:
EmailClassification.predict("Congratulations, you won @ free rolex")
如果您只是将数据作为字符串提供,管道会将其解释为字符列表并尝试预测每个字符,因此您得到的预测数量与字符串长度相同:
EmailClassification.predict("Congratulations, you won @ free rolex")
array(['ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham',
'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham',
'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham',
'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham',
'ham'], dtype='<U4')
应该是:
EmailClassification.predict(["Congratulations, you won @ free rolex"])
array(['spam'], dtype='<U4')