使用不平衡数据构建 ML 分类器

Question

我有一个包含 1400 个 obs 和 19 列的数据集。 Target 变量的值为 1（我最感兴趣的值）和 0。类的分布显示不平衡 (70:30)。

使用下面的代码我得到了奇怪的值（全是 1）。我不确定这是由于 overfitting/imbalance 数据问题还是特征选择（我使用了 Pearson 相关性，因为所有值都是 numeric/boolean）。我认为遵循的步骤是错误的。

import numpy as np
import math
import sklearn.metrics as metrics
from sklearn.metrics import f1_score

y = df['Label']
X = df.drop('Label',axis=1)

def create_cv(X,y):
    if type(X)!=np.ndarray:
        X=X.values
        y=y.values
 
    test_size=1/5
    proportion_of_true=y[y==1].shape[0]/y.shape[0]
    num_test_samples=math.ceil(y.shape[0]*test_size)
    num_test_true_labels=math.floor(num_test_samples*proportion_of_true)
    num_test_false_labels=math.floor(num_test_samples-num_test_true_labels)
    
    y_test=np.concatenate([y[y==0][:num_test_false_labels],y[y==1][:num_test_true_labels]])
    y_train=np.concatenate([y[y==0][num_test_false_labels:],y[y==1][num_test_true_labels:]])

    X_test=np.concatenate([X[y==0][:num_test_false_labels] ,X[y==1][:num_test_true_labels]],axis=0)
    X_train=np.concatenate([X[y==0][num_test_false_labels:],X[y==1][num_test_true_labels:]],axis=0)
    return X_train,X_test,y_train,y_test

X_train,X_test,y_train,y_test=create_cv(X,y)
X_train,X_crossv,y_train,y_crossv=create_cv(X_train,y_train)
    
tree = DecisionTreeClassifier(max_depth = 5)
tree.fit(X_train, y_train)       

y_predict_test = tree.predict(X_test)

print(classification_report(y_test, y_predict_test))
f1_score(y_test, y_predict_test)

输出：

     precision    recall  f1-score   support

           0       1.00      1.00      1.00        24
           1       1.00      1.00      1.00        70

    accuracy                           1.00        94
   macro avg       1.00      1.00      1.00        94
weighted avg       1.00      1.00      1.00        94

有没有人在数据不平衡时使用采样下的 CV and/or 构建分类器时遇到过类似的问题？很高兴分享整个数据集，以防您可能想要复制输出。我想问你一些明确的答案，可以告诉我步骤和我做错了什么。

我知道，为了减少过拟合和处理平衡数据，有一些方法，例如随机抽样（over/under）、SMOTE、CV。我的想法是

考虑到不平衡
在训练集上执行 CV
仅对测试折叠应用欠采样
在 CV 的帮助下选择模型后，对训练集进行欠采样并训练分类器
估计未接触测试集上的性能（f1 分数）

如本问题中所述： .

我认为上述步骤应该有意义，但很高兴收到您对此的任何反馈。

Answer 1

交叉验证或保留集

首先，你没有做交叉验证。您将数据拆分为 train/validation/test 组，这很好，而且在训练样本数量很大（例如 >2e4）时通常就足够了。但是，当样本数量较少时，例如您的情况，交叉验证就很有用了。

在scikit-learn's documentation中有深入的解释。您将从数据中取出测试集开始，就像 create_cv 函数所做的那样。然后，您将其余的训练数据拆分为例如3 分裂。然后，对于 {1, 2, 3} 中的 i：训练数据 j != i，评估数据 i。文档用更漂亮、更丰富的图形解释了它，你应该看看！实现起来可能非常麻烦，但希望 scikit 开箱即用。

至于数据集不平衡，最好在每组中保持相同的标签比例。但同样，您可以让 scikit 为您处理！

目的

此外，交叉验证的目的是为超参数选择正确的值。您需要适量的正则化，不要太大（欠拟合）也不要太小（过度拟合）。如果您使用的是决策树，则最大深度（或每片叶子的最小样本数）是估计 regularization of your method.

时要考虑的正确指标

结论

只需使用GridSearchCV。您将完成交叉验证和标签平衡。

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/5, stratified=True)
tree = DecisionTreeClassifier()
parameters = {'min_samples_leaf': [1, 5, 10]}
clf = GridSearchCV(svc, parameters, cv=5)  # Specifying cv does StratifiedShuffleSplit, see documentation
clf.fit(iris.data, iris.target)
sorted(clf.cv_results_.keys())

您还可以用更高级的洗牌器替换 cv 变量，例如 StratifiedGroupKFold（组之间没有交集）。

我还建议寻找随机树，它们的可解释性较差，但据说在实践中具有更好的性能。

Answer 2

当您的数据不平衡时，您必须执行分层。通常的方法是对值较少的 class 进行过采样。

另一个选择是用更少的数据训练你的算法。如果你有一个好的数据集，那应该不是问题。在这种情况下，您首先从代表较少的样本中获取样本 class 使用集合的大小来计算从其他样本中获取多少样本 class:

此代码可以帮助您以这种方式拆分数据集：

def split_dataset(dataset: pd.DataFrame, train_share=0.8):
    """Splits the dataset into training and test sets"""
    all_idx = range(len(dataset))
    train_count = int(len(all_idx) * train_share)

    train_idx = random.sample(all_idx, train_count)
    test_idx = list(set(all_idx).difference(set(train_idx)))

    train = dataset.iloc[train_idx]
    test = dataset.iloc[test_idx]

    return train, test

def split_dataset_stratified(dataset, target_attr, positive_class, train_share=0.8):
    """Splits the dataset as in `split_dataset` but with stratification"""

    data_pos = dataset[dataset[target_attr] == positive_class]
    data_neg = dataset[dataset[target_attr] != positive_class]

    if len(data_pos) < len(data_neg):
        train_pos, test_pos = split_dataset(data_pos, train_share)
        train_neg, test_neg = split_dataset(data_neg, len(train_pos)/len(data_neg))
        # set.difference makes the test set larger
        test_neg = test_neg.iloc[0:len(test_pos)]
    else:
        train_neg, test_neg = split_dataset(data_neg, train_share)
        train_pos, test_pos = split_dataset(data_pos, len(train_neg)/len(data_pos))
        # set.difference makes the test set larger
        test_pos = test_pos.iloc[0:len(test_neg)]

    return train_pos.append(train_neg).sample(frac = 1).reset_index(drop = True), \
           test_pos.append(test_neg).sample(frac = 1).reset_index(drop = True)

用法：

train_ds, test_ds = split_dataset_stratified(data, target_attr, positive_class)

您现在可以在 train_ds 上执行交叉验证并在 test_ds 中评估您的模型。

Answer 3

您实施的分层 train/test 创建不是最佳的，因为它缺乏随机性。数据通常是成批出现的，因此按原样获取数据序列而不进行混洗并不是一个好习惯。
正如@sturgemeister 提到的，classes 比率 3:7 并不重要，因此您不必过分担心 class 不平衡。当你在训练中人为改变数据平衡时，你需要通过乘以某些算法的先验来补偿它。
至于你的“完美”结果，要么你的模型训练过度，要么模型确实 class 完美地处理了数据。使用不同的 train/test 拆分来检查这一点。
还有一点：你的测试集只有94个数据点。绝对不是1400的1/5。查查你的数字。
要获得现实的估计，您需要大量测试数据。这就是您需要应用交叉验证策略的原因。
至于 5 倍 CV 的一般策略，我建议如下：
1. 根据标签将数据拆分为 5 倍（这称为分层拆分，您可以使用 StratifiedShuffleSplit 函数）
2. 进行 4 次拆分并训练您的模型。如果要使用under/oversampling，请修改这 4 个训练拆分中的数据。
3. 将模型应用于剩余部分。不要under/over测试部分的样本数据。通过这种方式，您可以获得真实的性能估计。保存结果。
4. 对所有测试拆分重复 2. 和 3.（显然总共 5 次）。重要提示：训练时不要更改模型的参数（例如树深度）——它们对于所有拆分应该相同。
5. 现在您已经对所有数据点进行了测试，而无需对其进行训练。这就是交叉验证的核心思想。连接所有保存的结果，并评估性能。

Answer 4

只是想将阈值和成本敏感学习添加到其他人提到的可能方法列表中。前者描述得很好 here and consists in finding a new threshold for classifying positive vs negative classes (generally is 0.5 but it can be treated as an hyper parameter). The latter consists on weighting the classes to cope with their unbalancedness. This article 对我理解如何处理不平衡数据集非常有用。在其中，您还可以找到使用决策树作为模型的特定解释的成本敏感学习。此外，所有其他方法都得到了很好的审查，包括：自适应合成采样、知情欠采样等。

Answer 5

还有一个模型级别的解决方案——使用支持样本权重的模型，例如梯度提升树。其中，CatBoost 通常是最好的，因为它的训练方法可以减少泄漏（如他们的 article 中所述）。

示例代码：

从 catboost 导入 CatBoostClassifier

y = df['Label']
X = df.drop('Label',axis=1)
label_ratio = (y==1).sum() / (y==0).sum()
model = CatBoostClassifier(scale_pos_weight = label_ratio)
model.fit(X, y)

等等。这是可行的，因为 Catboost 用权重处理每个样本，因此您可以提前确定 class 个权重 (scale_pos_weight)。这比下采样更好，并且在技术上等同于过采样（但需要更少的内存）。

此外，处理不平衡数据的一个主要部分是确保您的指标也被加权，或者至少定义明确，因为您可能希望这些指标具有相同的性能（或偏差性能）。

如果你想要比 sklearn 的 classification_report 更直观的输出，你可以使用 Deepchecks 内置检查之一（披露 - 我是维护者之一）：

from deepchecks.checks import PerformanceReport
from deepchecks import Dataset
PerformanceReport().run(Dataset(train_df, label='Label'), Dataset(test_df, label='Label'), model)

使用不平衡数据构建 ML 分类器

Building ML classifier with imbalanced data

python

machine-learning

resampling

scikit-learn

cross-validation

交叉验证或保留集

目的

结论