为什么精确率和召回率的值几乎与代表性不足的精确率和召回率相同 class

Question

我有二进制 class化，其中一个 classes 几乎是另一个 class.

的 0.1 大小

我正在使用 sklearn 创建模型并对其进行评估。我正在使用这两个功能：

print(precision_recall_fscore_support(y_real,y_pred))

out: 
(array([0.99549296, 0.90222222]), # precision of the first class and the second class
 array([0.98770263, 0.96208531]), # recall of the first class and the second class
 array([0.99158249, 0.93119266]), # F1 score of the first class and the second class
 array([1789,  211]))             # instances of the first class and the second class

其中 returns 每个 class

的 precison、recal、fscore 和支持

print(precision_score(y_real,y_pred),recall_score(y_real,y_pred))

out:
0.90222222 , 0.96208531 # precsion and recall of the model

其中 returns 预测的准确率和召回率。

为什么 precsion 和 recall 函数 returns 与具有较少实例的 class 的值完全相同（这里的 class 具有 211 个实例）？

Answer 1

可能是数据集不平衡造成的。您可以尝试对代表性不足的 class 进行过采样，或对过度代表性的 class 进行欠采样，具体取决于数据的方差水平。我有一个类似的不平衡数据问题，这篇文章帮助了我：

Medium Article on imbalanced data

Answer 2

仔细查看 precision_score and recall_score 的文档，您会看到两个参数 - pos_label，默认值为 1，average，默认值为'binary' 的值：

pos_label : str or int, 1 by default

The class to report if average='binary' and the data is binary.

average : *string, [None, ‘binary’ (default), ‘micro’, ‘macro’, ‘samples’, ‘weighted’]*

'binary':

Only report results for the class specified by pos_label. This is applicable only if targets (y_{true,pred}) are binary.

换句话说，正如文档中明确解释的那样，这两个函数 return 分别是一个 class 的准确率和召回率 - 用标签 1 指定的那个。

从你的显示来看，这个class似乎就是你这里所说的'second class'，结果确实与你报告的一致。

相比之下，precision_recall_fscore_support 函数，根据 docs（强调我的）：

Compute precision, recall, F-measure and support for each class

换句话说，这里没有什么奇怪或意外的；没有“整体”精度和召回率，根据 class 的定义，它们始终是计算机。实际上，在像这里这样的不平衡二进制设置中，它们通常只针对少数 class 进行计算。

为什么精确率和召回率的值几乎与代表性不足的精确率和召回率相同 class

Why the value of precision and recall is almost the same as precision and recall of the underrepresented class

precision

scikit-learn

precision-recall

imbalanced-data