XGBoostError: b'[19:12:58] src/metric/rank_metric.cc:89: Check failed: (preds.size()) == (info.labels.size()) label size predict size not match'

XGBoostError: b'[19:12:58] src/metric/rank_metric.cc:89: Check failed: (preds.size()) == (info.labels.size()) label size predict size not match'

我正在为我的训练集训练 XGBoostClassifier。

我的训练特征的形状为 (45001, 10338),这是一个 numpy 数组,我的训练标签的形状为 (45001,) [我有 1161 个独特的标签,所以我为labels] 这也是一个 numpy 数组。

文档清楚地表明我可以从 numpy 数组创建 DMatrix。所以我直接使用上面提到的训练特征和标签作为 numpy 数组。但是我收到以下错误

---------------------------------------------------------------------------
XGBoostError                              Traceback (most recent call last)
<ipython-input-30-3de36245534e> in <module>()
     13  scale_pos_weight=1,
     14  seed=27)
---> 15 modelfit(xgb1, train_x, train_y)

<ipython-input-27-9d215eac135e> in modelfit(alg, train_data_features, train_labels, useTrainCV, cv_folds, early_stopping_rounds)
      6         xgtrain = xgb.DMatrix(train_data_features, label=train_labels)
      7         cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
----> 8             metrics='auc',early_stopping_rounds=early_stopping_rounds)
      9         alg.set_params(n_estimators=cvresult.shape[0])
     10 

/home/carnd/anaconda3/envs/dl/lib/python3.5/site-packages/xgboost/training.py in cv(params, dtrain, num_boost_round, nfold, stratified, folds, metrics, obj, feval, maximize, early_stopping_rounds, fpreproc, as_pandas, verbose_eval, show_stdv, seed, callbacks)
    399         for fold in cvfolds:
    400             fold.update(i, obj)
--> 401         res = aggcv([f.eval(i, feval) for f in cvfolds])
    402 
    403         for key, mean, std in res:

/home/carnd/anaconda3/envs/dl/lib/python3.5/site-packages/xgboost/training.py in <listcomp>(.0)
    399         for fold in cvfolds:
    400             fold.update(i, obj)
--> 401         res = aggcv([f.eval(i, feval) for f in cvfolds])
    402 
    403         for key, mean, std in res:

/home/carnd/anaconda3/envs/dl/lib/python3.5/site-packages/xgboost/training.py in eval(self, iteration, feval)
    221     def eval(self, iteration, feval):
    222         """"Evaluate the CVPack for one iteration."""
--> 223         return self.bst.eval_set(self.watchlist, iteration, feval)
    224 
    225 

/home/carnd/anaconda3/envs/dl/lib/python3.5/site-packages/xgboost/core.py in eval_set(self, evals, iteration, feval)
    865             _check_call(_LIB.XGBoosterEvalOneIter(self.handle, iteration,
    866                                                   dmats, evnames, len(evals),
--> 867                                                   ctypes.byref(msg)))
    868             return msg.value
    869         else:

/home/carnd/anaconda3/envs/dl/lib/python3.5/site-packages/xgboost/core.py in _check_call(ret)
    125     """
    126     if ret != 0:
--> 127         raise XGBoostError(_LIB.XGBGetLastError())
    128 
    129 

XGBoostError: b'[19:12:58] src/metric/rank_metric.cc:89: Check failed: (preds.size()) == (info.labels.size()) label size predict size not match'

请在下面找到我的型号代码:

def modelfit(alg, train_data_features, train_labels,useTrainCV=True, cv_folds=5, early_stopping_rounds=50):

    if useTrainCV:
        xgb_param = alg.get_xgb_params()
        xgb_param['num_class'] = 1161   
        xgtrain = xgb.DMatrix(train_data_features, label=train_labels)
        cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
            metrics='auc',early_stopping_rounds=early_stopping_rounds)
        alg.set_params(n_estimators=cvresult.shape[0])

    #Fit the algorithm on the data
    alg.fit(train_data_features, train_labels, eval_metric='auc')

    #Predict training set:
    dtrain_predictions = alg.predict(train_data_features)
    dtrain_predprob = alg.predict_proba(train_data_features)[:,1]

    #Print model report:
    print("\nModel Report")
    print("Accuracy : %.4g" % metrics.accuracy_score(train_labels, dtrain_predictions))

我上面哪里出错了?

我的分类器如下:

xgb1 = xgb.XGBClassifier(
 learning_rate =0.1,
 n_estimators=50,
 max_depth=5,
 min_child_weight=1,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 objective='multi:softmax',
 nthread=4,
 scale_pos_weight=1,
 seed=27)

编辑 - 2 更改评估指标后,

---------------------------------------------------------------------------
XGBoostError                              Traceback (most recent call last)
<ipython-input-9-30c62a886c2e> in <module>()
     13  scale_pos_weight=1,
     14  seed=27)
---> 15 modelfit(xgb1, train_x_trail, train_y_trail)

<ipython-input-8-9d215eac135e> in modelfit(alg, train_data_features, train_labels, useTrainCV, cv_folds, early_stopping_rounds)
      6         xgtrain = xgb.DMatrix(train_data_features, label=train_labels)
      7         cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
----> 8             metrics='auc',early_stopping_rounds=early_stopping_rounds)
      9         alg.set_params(n_estimators=cvresult.shape[0])
     10 

/home/carnd/anaconda3/envs/dl/lib/python3.5/site-packages/xgboost/training.py in cv(params, dtrain, num_boost_round, nfold, stratified, folds, metrics, obj, feval, maximize, early_stopping_rounds, fpreproc, as_pandas, verbose_eval, show_stdv, seed, callbacks)
    398                            evaluation_result_list=None))
    399         for fold in cvfolds:
--> 400             fold.update(i, obj)
    401         res = aggcv([f.eval(i, feval) for f in cvfolds])
    402 

/home/carnd/anaconda3/envs/dl/lib/python3.5/site-packages/xgboost/training.py in update(self, iteration, fobj)
    217     def update(self, iteration, fobj):
    218         """"Update the boosters for one iteration"""
--> 219         self.bst.update(self.dtrain, iteration, fobj)
    220 
    221     def eval(self, iteration, feval):

/home/carnd/anaconda3/envs/dl/lib/python3.5/site-packages/xgboost/core.py in update(self, dtrain, iteration, fobj)
    804 
    805         if fobj is None:
--> 806             _check_call(_LIB.XGBoosterUpdateOneIter(self.handle, iteration, dtrain.handle))
    807         else:
    808             pred = self.predict(dtrain)

/home/carnd/anaconda3/envs/dl/lib/python3.5/site-packages/xgboost/core.py in _check_call(ret)
    125     """
    126     if ret != 0:
--> 127         raise XGBoostError(_LIB.XGBGetLastError())
    128 
    129 

XGBoostError: b'[03:43:03] src/objective/multiclass_obj.cc:42: Check failed: (info.labels.size()) != (0) label set cannot be empty'

错误是b/c您正在尝试将 AUC 评估指标用于多 class class化,但 AUC 仅适用于两个 class 问题。在 xgboost 实现中,"auc" 期望预测大小与标签大小相同,而您的 multiclass 预测大小为 45001*1161。使用 "mlogloss" 或 "merror" 多 class 指标。

P.S.:目前,xgboost 对于这么多 classes 会相当慢,因为训练期间预测缓存效率低下。

您得到的原始错误是因为此指标不是为多class class化设计的(参见here)。

您可以使用 xgboost 的 scikit learn wrapper 来解决这个问题。我用这个包装器修改了你的代码,以产生类似的功能。我不确定你为什么要进行网格搜索,因为你没有枚举参数。相反,您使用的是在 xgb1 中指定的参数。这是修改后的代码:

import xgboost as xgb
import sklearn
import numpy as np
from sklearn.model_selection import GridSearchCV

def modelfit(alg, train_data_features, train_labels,useTrainCV=True, cv_folds=5):

    if useTrainCV:
        params=alg.get_xgb_params()
        xgb_param=dict([(key,[params[key]]) for key in params])

        boost = xgb.sklearn.XGBClassifier()
        cvresult = GridSearchCV(boost,xgb_param,cv=cv_folds)
        cvresult.fit(X,y)
        alg=cvresult.best_estimator_


    #Fit the algorithm on the data
    alg.fit(train_data_features, train_labels)

    #Predict training set:
    dtrain_predictions = alg.predict(train_data_features)
    dtrain_predprob = alg.predict_proba(train_data_features)[:,1]

    #Print model report:
    print("\nModel Report")
    print("Accuracy : %.4g" % sklearn.metrics.accuracy_score(train_labels, dtrain_predictions))

xgb1 = xgb.sklearn.XGBClassifier(
 learning_rate =0.1,
 n_estimators=50,
 max_depth=5,
 min_child_weight=1,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 objective='multi:softmax',
 nthread=4,
 scale_pos_weight=1,
 seed=27)    


X=np.random.normal(size=(200,30))
y=np.random.randint(0,5,200)

modelfit(xgb1, X, y)

我得到的输出是

Model Report
Accuracy : 1

请注意,我使用的数据要小得多。以你提到的大小,算法可能会很慢。