测试折叠中的 CV 和欠采样

Question

我对构建具有不平衡数据 (80:20) 的 ML classifier 有点迷茫。数据集有 30 列；目标是标签。我想预测大class。我正在尝试重现以下步骤：

拆分 train/test
在训练集上执行 CV
仅对测试折叠应用欠采样
在 CV 的帮助下选择模型后，对训练集进行欠采样并训练 classifier
估计未接触测试集的性能（召回率）

我所做的如下图：

    y = df['Label']
    X = df.drop('Label',axis=1)
    X.shape, y.shape

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 12)
    X_train.shape, X_test.shape

    tree = DecisionTreeClassifier(max_depth = 5)

    tree.fit(X_train, y_train)

    y_test_tree = tree.predict(X_test)
    y_train_tree = tree.predict(X_train)

    acc_train_tree = accuracy_score(y_train,y_train_tree)
    acc_test_tree = accuracy_score(y_test,y_test_tree)

我对如何在训练集上执行 CV、在测试折叠上应用欠采样和欠采样训练集以及训练 classifier 有一些疑问。你熟悉这些步骤吗？如果是，我将不胜感激。

如果我这样做：

y = df['Label']
X = df.drop('Label',axis=1)
X.shape, y.shape

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 12)
X_train.shape, X_test.shape

tree = DecisionTreeClassifier(max_depth = 5)

tree.fit(X_train, y_train)

y_test_tree = tree.predict(X_test)
y_train_tree = tree.predict(X_train)

acc_train_tree = accuracy_score(y_train,y_train_tree)
acc_test_tree = accuracy_score(y_test,y_test_tree)
# CV
scores = cross_val_score(tree,X_train, y_train,cv = 3, scoring = "accuracy")
ypred = cross_val_predict(tree,X_train,y_train,cv = 3)

print(classification_report(y_train,ypred))
accuracy_score(y_train,ypred)
confusion_matrix(y_train,ypred)

我得到这个输出

             precision    recall  f1-score   support

      -1       0.73      0.99      0.84       291
       1       0.00      0.00      0.00       105

accuracy                           0.73       396
macro avg       0.37      0.50      0.42       396
weighted avg       0.54      0.73      0.62       396

我想我在上面的代码中遗漏了一些东西或者做错了什么。

数据样本：

Have_0 Have_1 Have_2 Have_letters Label
1        0      1         1         1
0        0      0         1        -1 
1        1      1         1        -1
0        1      0         0         1
1        1      0         0         1
1        0      0         1        -1
1        0      0         0         1

Answer 1

通常，创建交叉验证集的最佳方法是模拟您的测试数据。在您的情况下，如果我们要将您的数据分成 3 组（train、crossv.、test），最好的方法是创建具有相同比例的真实 label/false 标签的集合。这就是我在以下函数中所做的。

import numpy as np
import math
X=DF[["Have_0","Have_1","Have_2","Have_letters"]]
y=DF["Label"]
 


 
def create_cv(X,y):
    if type(X)!=np.ndarray:
        X=X.values
        y=y.values
 
    test_size=1/5
    proportion_of_true=y[y==1].shape[0]/y.shape[0]
    num_test_samples=math.ceil(y.shape[0]*test_size)
    num_test_true_labels=math.floor(num_test_samples*proportion_of_true)
    num_test_false_labels=math.floor(num_test_samples-num_test_true_labels)
    
    y_test=np.concatenate([y[y==0][:num_test_false_labels],y[y==1][:num_test_true_labels]])
    y_train=np.concatenate([y[y==0][num_test_false_labels:],y[y==1][num_test_true_labels:]])


    
    X_test=np.concatenate([X[y==0][:num_test_false_labels] ,X[y==1][:num_test_true_labels]],axis=0)
    X_train=np.concatenate([X[y==0][num_test_false_labels:],X[y==1][num_test_true_labels:]],axis=0)
    return X_train,X_test,y_train,y_test

    
    
X_train,X_test,y_train,y_test=create_cv(X,y)
X_train,X_crossv,y_train,y_crossv=create_cv(X_train,y_train)

通过这样做，我们得到了具有以下形状的集合（它们都具有相同比例的真实 label/false 标签）：

Answer 2

我假设你的测试数据不具有代表性，因为它对于这个目的来说太小了（拆分几次后剩下的不多了，因为 Cross-Validation 正在进一步拆分数据集）。

对于欠采样和过采样，有一个很棒的库，叫做 imbalanced-learn. It also comes with good documentation, such as on under sampling。

鉴于您的示例数据：

from io import StringIO
import pandas as pd
from sklearn.model_selection import train_test_split
from imblearn.under_sampling import RandomUnderSampler

df = pd.read_csv(StringIO(
    """
    Have_0 Have_1 Have_2 Have_letters Label
    1        0      1         1         1
    0        0      0         1        -1 
    1        1      1         1        -1
    0        1      0         0         1
    1        1      0         0         1
    1        0      0         1        -1
    1        0      0         0         1
    """
), sep='\s+')

y = df['Label']
X = df.drop('Label',axis=1)

然后您可以对训练数据集进行欠采样：

under_sampler = RandomUnderSampler(random_state=0)
X_train_resampled, y_train_resampled = under_sampler.fit_resample(X_train, y_train)

您可以将其传递给交叉验证。缺点是它将根据平衡数据集进行评估（作为 CV 的一部分）。不过，对于模型选择而言，这可能没问题。相反，您可以将采样应用于每个 CV 折叠的训练拆分（就像您必须为过采样所做的那样）。

如果您的数据集很小，那么您可能会受益于 over sampling instead. In that case you need to be mindful that over sampled data is not split afterwards. That is because it would create data leakage (leading to wrong scores; see imbalance-learn's common pitfalls). That is easy to avoid with train_test_split, as you can just call the sampler after calling train_test_split. But using cross validation would lead to more splits (hidden in cross_val_score). Here the over sampling would need to happen after each CV split. You could do that using sklearn's KFold or StratifiedKFold class 例如。

像这样：

def get_train_sampled_cv_splits(train_test_indices_splits, sampler, y):
    for train_indices, test_indices in train_test_indices_splits:
        y_train_split = y.iloc[train_indices]
        train_indices_resampled, _ = sampler.fit_resample(train_indices.reshape(-1, 1), y_train_split)
        yield train_indices_resampled.reshape(-1), test_indices

over_sampler = RandomOverSampler(random_state=0)
kf = KFold(n_splits=2, shuffle=True, random_state=42)
resampled_train_test_indices_splits = get_train_sampled_cv_splits(
    kf.split(X_train, y_train),
    over_sampler,
    y
)
cross_val_score(tree, X_train, y_train, cv=resampled_train_test_indices_splits, scoring="f1")

您也会注意到不平衡数据集的指标（准确性通常不是很好）。有人在 Kaggle 上分享了一张可能有用的图表 Evaluation Metrics for Imbalanced Classification。

测试折叠中的 CV 和欠采样

CV and under sampling on a test fold

python

machine-learning

sampling

scikit-learn

cross-validation