I got ValueError: X has 5851 features per sample; expecting 2754 when applying Linear SVC model to test set
I got ValueError: X has 5851 features per sample; expecting 2754 when applying Linear SVC model to test set
我正在尝试使用线性 SVC 对文本进行分类,但出现错误。
我将模型应用于测试集,如下所示。在这段代码中,我制作了Tfidf,并对训练集进行了过采样。
#Import datasets
train = pd.read_csv('train_labeled.csv')
test = pd.read_csv('test.csv')
#Clean datasets
custom_pipeline = [preprocessing.fillna,
preprocessing.lowercase,
preprocessing.remove_whitespace,
preprocessing.remove_punctuation,
preprocessing.remove_urls,
preprocessing.remove_digits,
preprocessing.stem
]
train["clean_text"] = train["text"].pipe(hero.clean, custom_pipeline)
test["clean_text"] = test["text"].pipe(hero.clean, custom_pipeline)
#Create Tfidf
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(train["clean_text"])
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_test_counts = count_vect.fit_transform(test["clean_text"])
X_test_tfidf = tfidf_transformer.fit_transform(X_test_counts)
#Oversampling of trainig set
over = RandomOverSampler(sampling_strategy='minority')
X_os, y_os = over.fit_resample(X_train_tfidf, train["label"])
#Model
clf = svm.LinearSVC(C=1.0, penalty='l2', loss='squared_hinge', dual=True, tol=1e-3)
clf.fit(X_os, y_os)
pred = clf.predict(X_test_tfidf)
我遇到了这样的错误。我认为是因为测试集有5851个样本,而训练集有2754个样本。
ValueError: X has 5851 features per sample; expecting 2754
遇到这种情况,我该怎么办?
不要对测试数据调用 fit_transform()
,因为变换器将学习新词汇,而不是像转换训练数据那样转换测试数据。要使用与训练数据相同的词汇表,请仅对测试数据使用 transform()
:
# initialize transformers
count_vect = CountVectorizer()
tfidf_transformer = TfidfTransformer()
# fit and transform train data
X_train_counts = count_vect.fit_transform(train["clean_text"])
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
# transform test data
X_test_counts = count_vect.transform(test["clean_text"])
X_test_tfidf = tfidf_transformer.transform(X_test_counts)
备注
如果不需要CountVectorizer
的输出,可以使用TfidfVectorizer
来减少代码量:
tfidf_vect = TfidfVectorizer()
X_train_tfidf = tfidf_vect.fit_transform(train["clean_text"])
X_test_tfidf = tfidf_vect.transform(test["clean_text"])
我正在尝试使用线性 SVC 对文本进行分类,但出现错误。
我将模型应用于测试集,如下所示。在这段代码中,我制作了Tfidf,并对训练集进行了过采样。
#Import datasets
train = pd.read_csv('train_labeled.csv')
test = pd.read_csv('test.csv')
#Clean datasets
custom_pipeline = [preprocessing.fillna,
preprocessing.lowercase,
preprocessing.remove_whitespace,
preprocessing.remove_punctuation,
preprocessing.remove_urls,
preprocessing.remove_digits,
preprocessing.stem
]
train["clean_text"] = train["text"].pipe(hero.clean, custom_pipeline)
test["clean_text"] = test["text"].pipe(hero.clean, custom_pipeline)
#Create Tfidf
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(train["clean_text"])
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_test_counts = count_vect.fit_transform(test["clean_text"])
X_test_tfidf = tfidf_transformer.fit_transform(X_test_counts)
#Oversampling of trainig set
over = RandomOverSampler(sampling_strategy='minority')
X_os, y_os = over.fit_resample(X_train_tfidf, train["label"])
#Model
clf = svm.LinearSVC(C=1.0, penalty='l2', loss='squared_hinge', dual=True, tol=1e-3)
clf.fit(X_os, y_os)
pred = clf.predict(X_test_tfidf)
我遇到了这样的错误。我认为是因为测试集有5851个样本,而训练集有2754个样本。
ValueError: X has 5851 features per sample; expecting 2754
遇到这种情况,我该怎么办?
不要对测试数据调用 fit_transform()
,因为变换器将学习新词汇,而不是像转换训练数据那样转换测试数据。要使用与训练数据相同的词汇表,请仅对测试数据使用 transform()
:
# initialize transformers
count_vect = CountVectorizer()
tfidf_transformer = TfidfTransformer()
# fit and transform train data
X_train_counts = count_vect.fit_transform(train["clean_text"])
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
# transform test data
X_test_counts = count_vect.transform(test["clean_text"])
X_test_tfidf = tfidf_transformer.transform(X_test_counts)
备注
如果不需要CountVectorizer
的输出,可以使用TfidfVectorizer
来减少代码量:
tfidf_vect = TfidfVectorizer()
X_train_tfidf = tfidf_vect.fit_transform(train["clean_text"])
X_test_tfidf = tfidf_vect.transform(test["clean_text"])