使用 SMOTE 创建平衡数据集 1:1，而不修改 R 中大多数 class 的观察结果

Question

我正在处理二进制 class化问题，我有一个不平衡的数据集。我想创建一个新的更平衡的数据集，每个 class 中有 50% 的观察值。为此，我在 DMwR library.

提供的 R 中使用 SMOTE 算法

在新数据集中，我想保持大多数class的观察值不变。

但是，我遇到了两个问题：

SMOTE 减少或增加多数class的观察数量（我只想增加少数class）。
SMOTE 生成的一些观测值包含 NA 值。

假设我有 20 个观察结果：大多数 class 有 17 个观察结果，少数 class 只有 3 个观察结果。这是我的代码：

library(DMwR)
library(dplyr)

sample_data <- data.frame(matrix(rnorm(200), nrow=20))
sample_data[1:17,"X10"] <- 0
sample_data[18:20,"X10"] <- 1
sample_data[,ncol(sample_data)] <- factor(sample_data[,ncol(sample_data)], levels = c('1','0'), labels = c('Yes','No'))
newDataSet <- SMOTE(X10 ~., sample_data, perc.over = 400, perc.under = 100)

在我的代码中，我修复了 perc.over = 400 以创建 12 个对少数 class 的新观察，并且我修复了 perc.under = 100 以保持多数 class 没有变化].

但是，当我检查 newDataSet 时，我观察到 SMOTE 将大多数 class 的数量从 17 减少到 12。此外，一些生成的观察值具有 NA 值。

得到的结果如下图所示：

Answer 1

根据?SMOTE：

for each case in the original data set belonging to the minority class, perc.over/100 new examples of that class will be created.

此外：

For instance, if 200 new examples were generated for the minority class, a value of perc.under of 100 will randomly select exactly 200 cases belonging to the majority classes from the original data set to belong to the final data set.

因此，您的情况是：

创建 12 个新的 Yes（除了原来的）。
随机选择 12 No。

包含 NA 的新 Yes 可能与 SMOTE 的 k 参数有关。根据 ?SMOTE:

k: A number indicating the number of nearest neighbours that are used to generate the new examples of the minority class.

它的默认值为5，但在你的原始数据中你只有3 Yes。设置 k = 2 似乎可以解决这个问题。

最后的评论：为了实现你的目标，我会使用 SMOTE 来增加少数 class 的观察数量（perc.over = 400 或 500）。然后，您可以将它们与大多数 class.

的原始观察结果结合起来

使用 SMOTE 创建平衡数据集 1:1，而不修改 R 中大多数 class 的观察结果

Create balanced dataset 1:1 using SMOTE without modifying the observations of the majority class in R

r

machine-learning

smote

imbalanced-data