带有 StratifiedKFold 的 CatboostRegressor 的值错误
Value error for CatboostRegressor with StratifiedKFold
我刚开始学习 Catboost 并尝试将 CatboostRegressor 与 StratifiedKFold 一起使用,但 运行 出错:
这是经过编辑的 post,包含完整的代码块和错误以供澄清。此外,还尝试了
对于我,枚举(fold.split(X,y)中的(train_index,test_index):
虽然没有工作。
from sklearn.model_selection import KFold,StratifiedKFold
from sklearn.metrics import mean_squared_log_error
from sklearn.preprocessing import LabelEncoder
from catboost import Pool, CatBoostRegressor
fold=StratifiedKFold(n_splits=5,shuffle=True,random_state=42)
err = []
y_pred = []
for train_index, test_index in fold.split(X,y):
#for i, (train_index, test_index) in enumerate(fold.split(X,y)):
X_train, X_val = X.iloc[train_index], X.iloc[test_index]
y_train, y_val = y[train_index], y[test_index]
_train = Pool(X_train, label = y_train)
_valid = Pool(X_val, label = y_val)
cb = CatBoostRegressor(n_estimators = 20000,
reg_lambda = 1.0,
eval_metric = 'RMSE',
random_seed = 42,
learning_rate = 0.01,
od_type = "Iter",
early_stopping_rounds = 2000,
depth = 7,
cat_features = cate,
bagging_temperature = 1.0)
cb.fit(_train,cat_features=cate,eval_set = _valid, early_stopping_rounds = 2000, use_best_model = True, verbose_eval = 100)
p = cb.predict(X_val)
print("err: ",rmsle(y_val,p))
err.append(rmsle(y_val,p))
pred = cb.predict(test_df)
y_pred.append(pred)
predictions = np.mean(y_pred,0)
ValueError Traceback (most recent call last)
<ipython-input-21-3a0df0c7b8d6> in <module>()
7 err = []
8 y_pred = []
----> 9 for train_index, test_index in fold.split(X,y):
10 #for i, (train_index, test_index) in enumerate(fold.split(X,y)):
11 X_train, X_val = X.iloc[train_index], X.iloc[test_index]
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site- packages/sklearn/model_selection/_split.py in split(self, X, y, groups)
333 .format(self.n_splits, n_samples))
334
--> 335 for train, test in super().split(X, y, groups):
336 yield train, test
337
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site- packages/sklearn/model_selection/_split.py in split(self, X, y, groups)
87 X, y, groups = indexable(X, y, groups)
88 indices = np.arange(_num_samples(X))
---> 89 for test_index in self._iter_test_masks(X, y, groups):
90 train_index = indices[np.logical_not(test_index)]
91 test_index = indices[test_index]
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sklearn/model_selection/_split.py in _iter_test_masks(self, X, y, groups)
684
685 def _iter_test_masks(self, X, y=None, groups=None):
--> 686 test_folds = self._make_test_folds(X, y)
687 for i in range(self.n_splits):
688 yield test_folds == i
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sklearn/model_selection/_split.py in _make_test_folds(self, X, y)
639 raise ValueError(
640 'Supported target types are: {}. Got {!r instead.'.format(
--> 641 allowed_target_types, type_of_target_y))
642
643 y = column_or_1d(y)
ValueError: Supported target types are: ('binary', 'multiclass'). Got 'continuous' instead.
您从基本 ML 理论中得到一个非常基本 原因的错误:仅为 分类 定义分层,以确保拆分中所有 类 的平等代表;它在回归中毫无意义。仔细阅读错误信息,你应该能够说服自己,它的意思是不支持'continous'
目标(即回归),只支持'binary'
或'multiclass'
(即分类);这不是 scikit-learn 的一些特性,而是一个根本问题。
相关提示也包含在 documentation 中(强调已添加):
Stratified K-Folds cross-validator
Provides train/test indices to split data in train/test sets.
This cross-validation object is a variation of KFold that returns
stratified folds. The folds are made by preserving the percentage of
samples for each class.
这是一个简短的演示,改编了文档中的示例,但将目标 y
更改为连续(回归)而不是离散(分类):
import numpy as np
from sklearn.model_selection import StratifiedKFold
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0.1, 0.5, -1.1, 1.2]) # continuous targets, i.e. regression problem
skf = StratifiedKFold(n_splits=2)
for train_index, test_index in skf.split(X,y):
print("something")
[...]
ValueError: Supported target types are: ('binary', 'multiclass'). Got 'continuous' instead.
所以,简单来说,你实际上不能在你的(回归)设置中使用 StratifiedKFold
;将其更改为简单的 KFold
并从那里继续...
我刚开始学习 Catboost 并尝试将 CatboostRegressor 与 StratifiedKFold 一起使用,但 运行 出错:
这是经过编辑的 post,包含完整的代码块和错误以供澄清。此外,还尝试了 对于我,枚举(fold.split(X,y)中的(train_index,test_index): 虽然没有工作。
from sklearn.model_selection import KFold,StratifiedKFold
from sklearn.metrics import mean_squared_log_error
from sklearn.preprocessing import LabelEncoder
from catboost import Pool, CatBoostRegressor
fold=StratifiedKFold(n_splits=5,shuffle=True,random_state=42)
err = []
y_pred = []
for train_index, test_index in fold.split(X,y):
#for i, (train_index, test_index) in enumerate(fold.split(X,y)):
X_train, X_val = X.iloc[train_index], X.iloc[test_index]
y_train, y_val = y[train_index], y[test_index]
_train = Pool(X_train, label = y_train)
_valid = Pool(X_val, label = y_val)
cb = CatBoostRegressor(n_estimators = 20000,
reg_lambda = 1.0,
eval_metric = 'RMSE',
random_seed = 42,
learning_rate = 0.01,
od_type = "Iter",
early_stopping_rounds = 2000,
depth = 7,
cat_features = cate,
bagging_temperature = 1.0)
cb.fit(_train,cat_features=cate,eval_set = _valid, early_stopping_rounds = 2000, use_best_model = True, verbose_eval = 100)
p = cb.predict(X_val)
print("err: ",rmsle(y_val,p))
err.append(rmsle(y_val,p))
pred = cb.predict(test_df)
y_pred.append(pred)
predictions = np.mean(y_pred,0)
ValueError Traceback (most recent call last)
<ipython-input-21-3a0df0c7b8d6> in <module>()
7 err = []
8 y_pred = []
----> 9 for train_index, test_index in fold.split(X,y):
10 #for i, (train_index, test_index) in enumerate(fold.split(X,y)):
11 X_train, X_val = X.iloc[train_index], X.iloc[test_index]
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site- packages/sklearn/model_selection/_split.py in split(self, X, y, groups)
333 .format(self.n_splits, n_samples))
334
--> 335 for train, test in super().split(X, y, groups):
336 yield train, test
337
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site- packages/sklearn/model_selection/_split.py in split(self, X, y, groups)
87 X, y, groups = indexable(X, y, groups)
88 indices = np.arange(_num_samples(X))
---> 89 for test_index in self._iter_test_masks(X, y, groups):
90 train_index = indices[np.logical_not(test_index)]
91 test_index = indices[test_index]
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sklearn/model_selection/_split.py in _iter_test_masks(self, X, y, groups)
684
685 def _iter_test_masks(self, X, y=None, groups=None):
--> 686 test_folds = self._make_test_folds(X, y)
687 for i in range(self.n_splits):
688 yield test_folds == i
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sklearn/model_selection/_split.py in _make_test_folds(self, X, y)
639 raise ValueError(
640 'Supported target types are: {}. Got {!r instead.'.format(
--> 641 allowed_target_types, type_of_target_y))
642
643 y = column_or_1d(y)
ValueError: Supported target types are: ('binary', 'multiclass'). Got 'continuous' instead.
您从基本 ML 理论中得到一个非常基本 原因的错误:仅为 分类 定义分层,以确保拆分中所有 类 的平等代表;它在回归中毫无意义。仔细阅读错误信息,你应该能够说服自己,它的意思是不支持'continous'
目标(即回归),只支持'binary'
或'multiclass'
(即分类);这不是 scikit-learn 的一些特性,而是一个根本问题。
相关提示也包含在 documentation 中(强调已添加):
Stratified K-Folds cross-validator
Provides train/test indices to split data in train/test sets.
This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.
这是一个简短的演示,改编了文档中的示例,但将目标 y
更改为连续(回归)而不是离散(分类):
import numpy as np
from sklearn.model_selection import StratifiedKFold
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0.1, 0.5, -1.1, 1.2]) # continuous targets, i.e. regression problem
skf = StratifiedKFold(n_splits=2)
for train_index, test_index in skf.split(X,y):
print("something")
[...]
ValueError: Supported target types are: ('binary', 'multiclass'). Got 'continuous' instead.
所以,简单来说,你实际上不能在你的(回归)设置中使用 StratifiedKFold
;将其更改为简单的 KFold
并从那里继续...