未见数据的所有概率值均小于 0.5

Question

我有 15 个具有二元响应变量的特征，我对预测概率比 0 或 1 class 标签感兴趣。当我使用 500 棵树、CV、平衡 class 权重和数据帧中的平衡样本训练和测试 RF 模型时，我获得了很好的准确性和很好的 Brier 分数。正如您在图像中看到的，class 1 在测试数据上的预测概率值在 0 到 1 之间。

这是测试数据的预测概率直方图：

多数值在 0 - 0.2 和 0.9 到 1 之间，这非常准确。但是，当我尝试预测未见数据的概率值时，或者说所有未知 0 或 1 值的数据点时，预测的概率值仅在 class 1 的 0 到 0.5 之间。为什么会这样？这些值不应该在 0.5 到 1 之间吗？

这是未见数据的预测概率直方图：

我在 python 中使用 sklearn RandomforestClassifier。代码如下：

#Read the CSV
df=pd.read_csv('path/df_all.csv')

#Change the type of the variable as needed
df=df.astype({'probabilities': 'int32', 'CPZ_CI_new.tif' : 'category'})

#Response variable is between 0 and 1 having actual probabilities values
y = df['probabilities']

# Separate majority and minority classes
df_majority = df[y == 0]
df_minority = df[y == 1]

# Upsample minority class
df_minority_upsampled = resample(df_minority,
                                 replace=True,  # sample with replacement
                                 n_samples=100387,  # to match majority class
                                 random_state=42)  # reproducible results

# Combine majority class with upsampled minority class
df1 = pd.concat([df_majority, df_minority_upsampled])

y = df1['probabilities']
X = df1.iloc[:,1:138]

#Change interfere values to category
y_01=y.astype('category')

#Split training and testing
X_train, X_valid, y_train, y_valid = train_test_split(X, y_01, test_size = 0.30, random_state = 42,stratify=y)

#Model

model=RandomForestClassifier(n_estimators = 500,
                           max_features= 'sqrt',
                           n_jobs = -1,
                           oob_score = True,
                           bootstrap = True,
                           random_state=0,class_weight='balanced',)
#I had 137 variable, to select the optimum one, I used RFECV
rfecv = RFECV(model, step=1, min_features_to_select=1, cv=10, scoring='neg_brier_score')
rfecv.fit(X_train, y_train)

#Retrained the model with only 15 variables selected
rf=RandomForestClassifier(n_estimators = 500,
                           max_features= 'sqrt',
                           n_jobs = -1,
                           oob_score = True,
                           bootstrap = True,
                           random_state=0,class_weight='balanced',)

#X1_train is same dataframe with but with only 15 varible 
rf.fit(X1_train,y_train)

#Printed ROC metric
print('roc_auc_score_testing:', metrics.roc_auc_score(y_valid,rf.predict(X1_valid)))

#Predicted probabilties on test data
predv=rf.predict_proba(X1_valid)
predv = predv[:, 1]
print('brier_score_training:', metrics.brier_score_loss(y_train, predt))
print('brier_score_testing:', metrics.brier_score_loss(y_valid, predv))

#Output is,
roc_auc_score_testing: 0.9832652130944419
brier_score_training: 0.002380976369884945
brier_score_testing: 0.01669848089917487

#Later, I have images of that 15 variables, I created a data frame out(sample_img) of it and use the same function to predict probabilities. 

IMG_pred=rf.predict_proba(sample_img)
IMG_pred=IMG_pred[:,1]

Answer 1

显示的测试数据结果无效；你执行了一个错误的程序，它有两个严重的后果，使它们无效。

这里的错误是你在拆分之前执行少数class上采样以训练和测试集，这不应该是这样的；你应该首先分成训练集和测试集，然后对训练数据执行上采样only，对测试数据执行not。

这样一个过程无效的第一个原因是，通过这种方式，一些由于上采样而产生的重复将最终同时进入训练和测试分裂；结果是该算法使用一些在训练期间已经看到的样本进行测试，这使测试集的最基本要求无效。更多详细信息，请参阅中自己的答案；从那里引用：

I once witnessed a case where the modeller was struggling to understand why he was getting a ~ 100% test accuracy, much higher than his training one; turned out his initial dataset was full of duplicates -no class imbalance here, but the idea is similar- and several of these duplicates naturally ended up in his test set after the split, without of course being new or unseen data...

第二个原因是这个过程在不再代表现实的测试集中显示了有偏见的性能指标：记住，我们希望我们的测试集能够代表真实的看不见的数据，这当然是不平衡的;人为地平衡我们的测试集并声称它具有 X% 的准确度，而这个准确度的很大一部分将归因于人为上采样的少数 class 是没有意义的，并且会产生误导性的印象。有关详细信息，请参阅 Balance classes in cross validation 中自己的答案（对于训练测试拆分的情况，原理是相同的，如此处）。

第二个原因是即使你没有犯第一个错误，你的程序仍然是错误的，并且你在拆分后分别对训练集和测试集进行了上采样。

简而言之，你应该改进这个过程，这样你就可以先分成训练集和测试集，然后只对你的训练集进行上采样。

未见数据的所有概率值均小于 0.5

All probability values are less than 0.5 on unseen data

python

machine-learning

random-forest

scikit-learn

imbalanced-data