Sklearn Pipeline 和原始模型的答案不一样 "Fixed Output"

Question

我正在为 SMS 开发简单文本 class化，完整模型将分为 3 个步骤：

TextCleaning()“自定义函数”
TfidfVectorizer() “向量化器”
MultinomialNB() "分类模型"

我想使用 sklearn.pipeline 将 3 个步骤合并到一个模型中并使用 joblib.dump 保存模型，问题是当加载保存的模型时，输出是 fixed 每次使用 spam class 的任何测试或训练数据，我得到 ham!

这是Pipeline之前的自定义函数：

def TextCleaning(X):
    documents = []
    
    for sent in X:
        # Remove all single characters
        sent = re.sub(r'\s+[a-zA-Z]\s+', ' ', sent)
        
        # Substituting multiple spaces with single space
        sent = re.sub(r'\s+', ' ', sent, flags=re.I)
        
        doc = nlp(sent)
        
        document = [token.lemma_ for token in doc]
        
        document = ' '.join(document)
        
        documents.append(document)
    return documents

这是 TextCleaning 的编码，作为 class for Pipeline :

class TextCleaning():
    def __init__(self):
        print("call init")
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        documents = []
        for sent in X:
            # Remove all single characters
            sent = re.sub(r'\s+[a-zA-Z]\s+', ' ', sent)

            # Substituting multiple spaces with single space
            sent = re.sub(r'\s+', ' ', sent, flags=re.I)

            doc = nlp(sent)

            document = [token.lemma_ for token in doc]

            document = ' '.join(document)

            documents.append(document)
            
        return documents

这是Pipeline代码：

EmailClassification = Pipeline([('TextCleaning', TextCleaning()),
                                ('Vectorizer', TfidfVectorizer(stop_words=STOP_WORDS)),
                                ('NB', MultinomialNB())])

Github Ham-or-Spam-SMS-Classification

上的完整笔记本和数据

Answer 1

在你的笔记本中，你正在做：

EmailClassification.predict("Congratulations, you won @ free rolex")

如果您只是将数据作为字符串提供，管道会将其解释为字符列表并尝试预测每个字符，因此您得到的预测数量与字符串长度相同：

EmailClassification.predict("Congratulations, you won @ free rolex")
array(['ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham',
       'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham',
       'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham',
       'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham',
       'ham'], dtype='<U4')

应该是：

EmailClassification.predict(["Congratulations, you won @ free rolex"])
array(['spam'], dtype='<U4')

Sklearn Pipeline 和原始模型的答案不一样 "Fixed Output"

Sklearn Pipeline and original model aren't the same answer "Fixed Output"

python

pipeline

scikit-learn

joblib

tfidfvectorizer