"Number of features of the model must match the input" 在尝试预测新的看不见的数据时

"Number of features of the model must match the input" while trying to predict new unseen data

我在一些维基百科文章上训练了一个模型,分为两类(每个类别有 12 篇文章)。

下面是我如何创建模型、训练模型和 pickle 的:

import numpy as np
import re
import nltk
from sklearn.datasets import load_files
import pickle
from nltk.corpus import stopwords
data = load_files(r'[...]review_polarity')
X, y = data.data, data.target
documents = []
from nltk.stem import WordNetLemmatizer
stemmer = WordNetLemmatizer()
for sen in range(0, len(X)):  
    # Remove all the special characters
    document = re.sub(r'\W', ' ', str(X[sen]))

    # remove all single characters
    document = re.sub(r'\s+[a-zA-Z]\s+', ' ', document)

    # Remove single characters from the start
    document = re.sub(r'\^[a-zA-Z]\s+', ' ', document) 

    # Substituting multiple spaces with single space
    document = re.sub(r'\s+', ' ', document, flags=re.I)

    # Removing prefixed 'b'
    document = re.sub(r'^b\s+', '', document)

    # Converting to Lowercase
    document = document.lower()

    # Lemmatization
    document = document.split()

    document = [stemmer.lemmatize(word) for word in document]
    document = ' '.join(document)

    documents.append(document)

from sklearn.feature_extraction.text import TfidfTransformer
tfidfconverter = TfidfTransformer()
X = tfidfconverter.fit_transform(X).toarray()

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=1000,random_state=0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

with open('text_classifier', 'wb') as picklefile:
    pickle.dump(classifier, picklefile)

然后,我加载了 pickle 文件并尝试预测一篇新的未见文章的分类:

import pickle
import sys, os
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer

with open(os.path.join(sys.path[0], 'text_classifier'), 'rb') as training_model:
    model = pickle.load(training_model)

with open(os.path.join(sys.path[0], 'article.txt'), 'rb') as f:
    X = [f.read()]

documents = []
stemmer = WordNetLemmatizer()

for sen in range(0, len(X)):  
    # Remove all the special characters
    document = re.sub(r'\W', ' ', str(X[sen]))

    # remove all single characters
    document = re.sub(r'\s+[a-zA-Z]\s+', ' ', document)

    # Remove single characters from the start
    document = re.sub(r'\^[a-zA-Z]\s+', ' ', document) 

    # Substituting multiple spaces with single space
    document = re.sub(r'\s+', ' ', document, flags=re.I)

    # Removing prefixed 'b'
    document = re.sub(r'^b\s+', '', document)

    # Converting to Lowercase
    document = document.lower()

    # Lemmatization
    document = document.split()

    document = [stemmer.lemmatize(word) for word in document]
    document = ' '.join(document)

    documents.append(document)

tfidfconverter = TfidfVectorizer(max_features=1500, min_df=0, max_df=1.0, stop_words=stopwords.words('english'))
X = tfidfconverter.fit_transform(documents).toarray()

y_pred = model.predict(X)
print y_pred

调用预测函数时出现以下错误:

模型的特征数量必须与输入匹配。模型n_features为10,输入n_features为47

新文章似乎得到了一个包含 47 个特征的 numpy 数组,而训练模型使用的是包含 10 个特征的数组。我不确定我是否理解正确,如果你能帮助我更好地理解并让它发挥作用,我会很高兴。

谢谢!

答案是我应该为新的未见数据使用 "transform" 函数而不是 "fit_transform" 以保持特征数量不变。