I got ValueError: X has 5851 features per sample; expecting 2754 when applying Linear SVC model to test set

I got ValueError: X has 5851 features per sample; expecting 2754 when applying Linear SVC model to test set

我正在尝试使用线性 SVC 对文本进行分类,但出现错误。

我将模型应用于测试集,如下所示。在这段代码中,我制作了Tfidf,并对训练集进行了过采样。

#Import datasets
train = pd.read_csv('train_labeled.csv')
test = pd.read_csv('test.csv')

#Clean datasets
custom_pipeline = [preprocessing.fillna,
                   preprocessing.lowercase,
                   preprocessing.remove_whitespace,
                   preprocessing.remove_punctuation,
                   preprocessing.remove_urls,
                   preprocessing.remove_digits,
                   preprocessing.stem  
                   ]



train["clean_text"] = train["text"].pipe(hero.clean, custom_pipeline)
test["clean_text"] = test["text"].pipe(hero.clean, custom_pipeline)

#Create Tfidf

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(train["clean_text"])
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

X_test_counts = count_vect.fit_transform(test["clean_text"])
X_test_tfidf = tfidf_transformer.fit_transform(X_test_counts)

#Oversampling of trainig set
over = RandomOverSampler(sampling_strategy='minority')

X_os, y_os = over.fit_resample(X_train_tfidf, train["label"])

#Model
clf = svm.LinearSVC(C=1.0, penalty='l2', loss='squared_hinge', dual=True, tol=1e-3)
clf.fit(X_os, y_os)

pred = clf.predict(X_test_tfidf)

我遇到了这样的错误。我认为是因为测试集有5851个样本,而训练集有2754个样本。

ValueError: X has 5851 features per sample; expecting 2754

遇到这种情况,我该怎么办?

不要对测试数据调用 fit_transform(),因为变换器将学习新词汇,而不是像转换训练数据那样转换测试数据。要使用与训练数据相同的词汇表,请仅对测试数据使用 transform()

# initialize transformers
count_vect = CountVectorizer()
tfidf_transformer = TfidfTransformer()

# fit and transform train data
X_train_counts = count_vect.fit_transform(train["clean_text"])
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

# transform test data
X_test_counts = count_vect.transform(test["clean_text"])
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

备注

如果不需要CountVectorizer的输出,可以使用TfidfVectorizer来减少代码量:

tfidf_vect = TfidfVectorizer()

X_train_tfidf = tfidf_vect.fit_transform(train["clean_text"])
X_test_tfidf = tfidf_vect.transform(test["clean_text"])