为什么我在 R 中使用 SMOTE 时得到 'Error in T[, col] <- data[, col]'？

Question

我有一个很大的森林火灾数据集，我想预测火灾何时发生。这种情况很少发生：620 000 次中有 290 次。

A tibble: 62,905 x 13
   amplitude polarity DEM_avg   DC   DMC   DSR    FFMC    Pd    RH  TEMP  WS  tree_cover  fire
       <dbl>    <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>      <dbl> <fct>
 1     -37.8      0     165.   269.  21.9  0.607  84.0   0    65.1  290. 4.36      8        0
 2     -68.1      0     303.   168.  44.5  1.41   89.9   0    46.6  296. 0.692     34.7     0
 3     -54.3      0     332.   168.  44.5  1.41   89.9   0    46.6  296. 0.692     35.8     1
 4    -108.       0     338.   168.  44.5  1.41   89.9   0    46.6  296. 0.692     30.3     0
 5     -60.3      0     374.   171.  35.7  2.30   88.9   0.3  51.7  295. 4.01      29.6     1
 6     -82.8      0     48.2   133.  18.4  0.210  84.9   0    65.1  289. 1.35      18.7     0
 7     -99.6      0     299.   219.  42.6  2.09   90.8   0    34.2  297. 1.42       7       1
 8     -98.1      0     116.   153.  44.7  0.988  89.0   0    41.3  298. 0.235     32.6     0

我尝试使用 SMOTE 来平衡我高度不平衡的数据集与 StupidWolf 建议的更改。我执行以下操作：

library(readr)
library(tidyverse)
library(caret)
library(DMwR)
data <- read_csv("data/fire2018.csv", 
    col_types = cols(fire = col_factor(levels = c("0", 
        "1"))))
training.samples <- data$fire %>% createDataPartition(p = 0.8, list = FALSE)
train.data  <- data[training.samples, ]
test.data <- data[-training.samples, ]
SMOTE(fire ~ amplitude + polarity_dummy + DEM_avg + DC + DMC + DSR + FFMC + Pd + RH + T + VPD + WS + tree_cover, data = data.frame(train.data), perc.over = 600, perc.under = 100)

但是，当我使用 DMwR 包中的 SMOTE 时，我现在收到以下错误：

Error in factor(newCases[, a], levels = 1:nlevels(data[, a]), labels = levels(data[,  : 
  invalid 'labels'; length 0 should be 1 or 2
In addition: Warning messages:
1: In if (class(data[, col]) %in% c("factor", "character")) { :
  the condition has length > 1 and only the first element will be used
2: In smote.exs(data[minExs, ], ncol(data), perc.over, k) :
  NAs introduced by coercion
3: In smote.exs(data[minExs, ], ncol(data), perc.over, k) :
  NAs introduced by coercion

我一直在寻找不同的解决方案。有人建议将变量转换为数字和因子，但我的变量已经正确转换。我的因变量是有 2 个水平的因子，自变量是数字，我的任何变量都没有 N/A。但是，这对我的情况没有帮助。我遇到了类似的错误。

Answer 1

在你展示的例子中，依赖仍然是数字，你需要将它编码为一个因素。函数 SMOTE 也不适用于 tibble。我无法得到与您相同的错误，但我怀疑如果您喜欢我在下面所做的，它应该可以工作，否则请提供可重现的示例：

library(DMwR)
library(tibble)
data = iris
data$Species = ifelse(data$Species=="versicolor",1,0)
data = tibble(data)

在上面的例子中，Species是从属，编码为0/1。你可以结构，依赖是像你一样的数字（参见你的物种和下火）：

head(data)
# A tibble: 6 x 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
         <dbl>       <dbl>        <dbl>       <dbl>   <dbl>
1          5.1         3.5          1.4         0.2       0
2          4.9         3            1.4         0.2       0
3          4.7         3.2          1.3         0.2       0

这些抛出错误：

newData <- SMOTE(Species ~ Sepal.Width+Sepal.Length,data=data,perc.over = 100, perc.under = 200)

# convert to factor
data$Species = factor(data$Species)

newData <- SMOTE(Species ~ Sepal.Width+Sepal.Length,data=data,perc.over = 100, perc.under = 200)

如果你这样做，就可以了：

newData <- SMOTE(Species ~ Sepal.Width+Sepal.Length,
data=data.frame(data),perc.over = 100, perc.under = 200)

dim(newData)
[1] 200   5

Answer 2

所以，在这个问题上花了几个小时之后。我终于在 StupidWolf 的帮助下找到了以下解决方案：我不得不清理我的数据集，其中包含许多我没有使用的不同变量。在这里，有 N/A 个。显然，当我无论如何都没有使用变量时，R 无法处理这个问题。所以总结一下。我最终将 SMOTE 函数中的数据部分更改为 data.frame。我的代码是这样结束的：

library(readr)
library(tidyverse)
library(caret)
library(DMwR)
data <- read_csv("data/test.csv", 
+                  col_types = cols(fire = col_factor(levels = c("0", 
+                                                                "1"))))
training.samples <- data$fire %>% createDataPartition(p = 0.8, list = FALSE)
train.data  <- data[training.samples, ]
test.data <- data[-training.samples, ]
newData <- SMOTE(fire ~ amplitude + polarity_dummy + DEM_avg + DC + DMC + DSR + FFMC + Pd + RH + T + VPD + WS + tree_cover, data = data.frame(train.data), perc.over = 10000, perc.under = 1000)

为什么我在 R 中使用 SMOTE 时得到 'Error in T[, col] <- data[, col]'？

Why do I get 'Error in T[, col] <- data[, col]' when I use SMOTE in R?

r

logistic-regression

smote