不平衡学习：实例硬度阈值方法中的阈值是如何计算的？

Question

我正在查看来自不平衡学习的 InstanceHardnessThreshold 转换器的源代码，此处：https://github.com/scikit-learn-contrib/imbalanced-learn/blob/12b2e0d/imblearn/under_sampling/_prototype_selection/_instance_hardness_threshold.py#L167

我想知道阈值是如何计算出来的，原理是什么？

Answer 1

在与 imbalanced-learn 包的维护者讨论后，这是我学到的：

阈值确定如下：

threshold = np.percentile(
            probabilities[y == target_class],
            (1.0 - (n_samples / target_stats[target_class])) * 100.0,
)

其中 n_samples 是最终数据集中所需的来自大多数 class 的样本数，target_stats[target_class] 是大多数 target_class] 的总数 class 存在于原始数据集中。

我们需要找到一个概率阈值，使得高于该阈值的样本数与 sampling_strategy 中请求的样本数一致。默认情况下，它将是少数样本的数量class，除非用户另有声明。

实例硬度是观察结果未被 class 化的概率。也就是说，它是class.

的1-概率

这个想法是估计器给出的概率与样本属于 class 的确定性有关。因此，0.0 的百分位数意味着我们 select 所有样本，而 1.0 的百分位数意味着我们将 select 单个样本（具有最大概率的样本）。因此，阈值对应于 select N 个最确定的样本属于 class C，正如每个估计器所见。 N 由 sampling_strategy 参数定义（例如，预期的平衡比率）。

此方法可能 return 比用户请求的观测值更多。文档中提到了这一点。

不平衡学习：实例硬度阈值方法中的阈值是如何计算的？

imbalanced-learn: how is the threshold calculated in the instance hardness threshold method?

imbalanced-data