XgBoost : the least populated class in y has only 1 members, 太少了

Question

我在 sklearn 上使用 Xgboost 实现进行 kaggle 竞赛。但是，我收到此 'warning' 消息：

$pythonScript1.py /home/sky/private/virtualenv15.0.1dev/myVE/local/lib/python2.7/site-packages/sklearn/cross_validation.py:516:

警告：y中人口最少的class只有1个成员，太少了。任何 class 的最小标签数不能少于 n_folds=3。 % (min_labels, self.n_folds)), 警告)

根据 Whosebug 上的另一个问题： "Check that you have at least 3 samples per class to be able to do StratifiedKFold cross validation with k == 3 (I think this is the default CV used by GridSearchCV for classification)."

好吧，每个 class 我没有至少 3 个样本。

所以我的问题是：

a)有哪些选择？

b) 为什么我不能使用交叉验证？

c) 我可以用什么代替？

...
param_test1 = {
    'max_depth': range(3, 10, 2),
    'min_child_weight': range(1, 6, 2)
}

grid_search = GridSearchCV(

estimator=
XGBClassifier(
    learning_rate=0.1,
    n_estimators=3000,
    max_depth=15,
    min_child_weight=1,
    gamma=0,
    subsample=0.8,
    colsample_bytree=0.8,
    objective='multi:softmax',
    nthread=42,
    scale_pos_weight=1,
    seed=27),

    param_grid=param_test1, scoring='roc_auc', n_jobs=42, iid=False, cv=None, verbose=1)
...

grid_search.fit(train_x, place_id)

参考文献：

One-shot learning with scikit-learn

Using a support vector classifier with polynomial kernel in scikit-learn

Answer 1

如果您的 target/class 只有一个样本，那么对于任何模型来说都太少了。你可以做的是获得另一个数据集，最好尽可能平衡，因为大多数模型在平衡集中表现得更好。

如果您不能拥有另一个数据集，您将不得不使用已有的数据集。我建议您删除具有孤独目标的样本。因此，您将拥有一个不涵盖该目标的模型。如果这不符合您的要求，您需要一个新的数据集。

XgBoost : the least populated class in y has only 1 members, 太少了

XgBoost : The least populated class in y has only 1 members, which is too few

python

scikit-learn

cross-validation

xgboost