Scikit-learn 管道:非有限测试分数错误/样本数量不一致
Scikit-learn pipeline: Non-finite test scores error / Inconsistent number of samples
我有一个包含两列文本和只有 POS 标签(相同文本)的数据框,我想将其用于语言分类。我正在尝试将这两个功能用作模型的一部分。这是数据的样子:
X_train.head()
这是数据的形状:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
X_train.shape[0] != y_train.shape[0]
(11000, 2)
(11000,)
(1100, 2)
(1100,)
False
当我 运行 我的估计器分别在我的训练集中的任一列上时,它工作正常。但是一旦我将两列都包括在内并且 运行 我的估算器:
scaler = MaxAbsScaler()
count_vect = CountVectorizer(lowercase = False, max_features = 1000)
clf = SVC()
pipe = make_pipeline(count_vect, scaler, clf)
params = [{
'countvectorizer__analyzer': ['word', 'char'],
'countvectorizer__ngram_range': [(1, 1), (1, 2)],
'svc__kernel': ['linear', 'rbf', 'poly']
}]
gs = GridSearchCV(pipe, params, cv=3, scoring='accuracy', n_jobs=-1, refit=True, verbose=1)
gs.fit(X_train, y_train)
print(gs.best_score_)
print(gs.best_params_)
我收到这个错误:
UserWarning: One or more of the test scores are non-finite: [nan nan nan nan nan nan nan nan nan nan nan nan]
ValueError: Found input variables with inconsistent numbers of samples: [2, 11000]
我曾尝试将类型从系列更改为字符串,并 运行 宁 .transpose()
功能,但都没有奏效。我不明白是什么导致了楠。你能帮忙吗?
我认为问题在于 CountVectorizer
需要一维输入。您可以通过使用 ColumnTransformer
来解决这个问题,其中包含两份矢量化器,每列一份。
例如,假设 X_train
是一个包含列 text
和 pos
的框架:
scaler = MaxAbsScaler()
count_vect = CountVectorizer(lowercase=False, max_features=1000)
vectorizer = ColumnTransformer([
('vec_txt', count_vect, 'text'),
('vec_pos', count_vect, 'pos'),
])
clf = SVC()
pipe = make_pipeline(vectorizer, scaler, clf)
params = {
'columntransformer__vec_txt__analyzer': ['word', 'char'],
'columntransformer__vec_txt__ngram_range': [(1, 1), (1, 2)],
'columntransformer__vec_pos__analyzer': ['word', 'char'],
'columntransformer__vec_pos__ngram_range': [(1, 1), (1, 2)],
'svc__kernel': ['linear', 'rbf', 'poly'],
}
gs = GridSearchCV(pipe, params, cv=3, scoring='accuracy', n_jobs=-1, refit=True, verbose=1)
gs.fit(X_train, y_train)
我有一个包含两列文本和只有 POS 标签(相同文本)的数据框,我想将其用于语言分类。我正在尝试将这两个功能用作模型的一部分。这是数据的样子: X_train.head()
这是数据的形状:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
X_train.shape[0] != y_train.shape[0]
(11000, 2)
(11000,)
(1100, 2)
(1100,)
False
当我 运行 我的估计器分别在我的训练集中的任一列上时,它工作正常。但是一旦我将两列都包括在内并且 运行 我的估算器:
scaler = MaxAbsScaler()
count_vect = CountVectorizer(lowercase = False, max_features = 1000)
clf = SVC()
pipe = make_pipeline(count_vect, scaler, clf)
params = [{
'countvectorizer__analyzer': ['word', 'char'],
'countvectorizer__ngram_range': [(1, 1), (1, 2)],
'svc__kernel': ['linear', 'rbf', 'poly']
}]
gs = GridSearchCV(pipe, params, cv=3, scoring='accuracy', n_jobs=-1, refit=True, verbose=1)
gs.fit(X_train, y_train)
print(gs.best_score_)
print(gs.best_params_)
我收到这个错误:
UserWarning: One or more of the test scores are non-finite: [nan nan nan nan nan nan nan nan nan nan nan nan]
ValueError: Found input variables with inconsistent numbers of samples: [2, 11000]
我曾尝试将类型从系列更改为字符串,并 运行 宁 .transpose()
功能,但都没有奏效。我不明白是什么导致了楠。你能帮忙吗?
我认为问题在于 CountVectorizer
需要一维输入。您可以通过使用 ColumnTransformer
来解决这个问题,其中包含两份矢量化器,每列一份。
例如,假设 X_train
是一个包含列 text
和 pos
的框架:
scaler = MaxAbsScaler()
count_vect = CountVectorizer(lowercase=False, max_features=1000)
vectorizer = ColumnTransformer([
('vec_txt', count_vect, 'text'),
('vec_pos', count_vect, 'pos'),
])
clf = SVC()
pipe = make_pipeline(vectorizer, scaler, clf)
params = {
'columntransformer__vec_txt__analyzer': ['word', 'char'],
'columntransformer__vec_txt__ngram_range': [(1, 1), (1, 2)],
'columntransformer__vec_pos__analyzer': ['word', 'char'],
'columntransformer__vec_pos__ngram_range': [(1, 1), (1, 2)],
'svc__kernel': ['linear', 'rbf', 'poly'],
}
gs = GridSearchCV(pipe, params, cv=3, scoring='accuracy', n_jobs=-1, refit=True, verbose=1)
gs.fit(X_train, y_train)