R-caret-plyr：如何修改downSample函数来创建不同比例的采样数据

Question

下面是我发现here .

的caret的downSample函数

downSample <- function(x, y, list = FALSE, yname = "Class")
  {
    xc <- class(x)
    if(!is.data.frame(x)) x <- as.data.frame(x)
    if(!is.factor(y))
      {
        warning("Down-sampling requires a factor variable as the response. The original data was returned.")
        return(list(x = x, y = y))
      }

    minClass <- min(table(y))
    x$.outcome <- y
    
    x <- ddply(x, .(y),
               function(dat, n) dat[sample(seq(along = dat$.outcome), n),,drop = FALSE],
               n = minClass)
    y <- x$.outcome
    x <- x[, !(colnames(x) %in% c("y", ".outcome")), drop = FALSE]
    if(list)
      {
        if(xc[1] == "matrix") x <- as.matrix(x)
        out <- list(x = x, y = y)
      } else {
        out <- cbind(x, y)
        colnames(out)[ncol(out)] <- yname
      }
    out
  }

假设我的数据集是iris:

data(iris) 
x <- iris[, -5]
y <- iris[, 5]

使响应变量成为一个极度不平衡的二元变量：

y[-c(130, 146)] <- "setosa"

因此现在有两个 "virginica" 实例和 148 个 "setosa" 实例。我想修改函数 downSample，这样，最后，它不会返回具有 50% minClass 的二次采样数据集，而是 returns 具有例如 30% (k) 的二次采样数据集次要 class 和主要 class 的 70%。因为在 minClass 中使用 n 个实例 的 downSample 函数，它选择另一个 class 的 n 个实例 来获得一个完全平衡数据集。但就我而言，我丢失了很多数据，所以我只是想稍微平衡一下。让我们假设 k = 20 % 即最后我想要 minClaas 的 20% 和其他 class 的 80%。我已经尝试修改这部分功能：

x <- ddply(x, .(y), function(dat, n) 
             dat[sample(seq(along = dat$.outcome), n),, drop = FALSE], n = minClass)

通过将 n 更改为 4*n 但我没有实现。有这个错误：

Error in size <= n/2 : comparison (4) is possible only for atomic and list types

我们将不胜感激。

Answer 1

执行此操作的一个简单方法是更改 ddply 调用的 n = minClass 部分。

downSample_custom <- function(x, y, list = FALSE, yname = "Class", frac = 1){ #add argument frac which is in the 0 - 1 range
  xc <- class(x)
  if(!is.data.frame(x)) x <- as.data.frame(x)
  if(!is.factor(y))
  {
    warning("Down-sampling requires a factor variable as the response. The original data was returned.")
    return(list(x = x, y = y))
  }
  
  minClass <- min(table(y))
  x$.outcome <- y
  
  x <- ddply(x, .(y),
             function(dat, n) dat[sample(seq(along = dat$.outcome), n),,drop = FALSE],
             n = minClass*frac) #change the n to this
  y <- x$.outcome
  x <- x[, !(colnames(x) %in% c("y", ".outcome")), drop = FALSE]
  if(list)
  {
    if(xc[1] == "matrix") x <- as.matrix(x)
    out <- list(x = x, y = y)
  } else {
    out <- cbind(x, y)
    colnames(out)[ncol(out)] <- yname
  }
  out
}

有效吗：

library(plyr)

y 不平衡：

set.seed(1)
y <- as.factor(sample(c("M", "F"),
                      prob = c(0.1, 0.9),
                      size = 10000,
                      replace = TRUE))


x <- rnorm(10000)

table(downSample_custom(x, y)[,2])

输出：

   F    M 
1044 1044 

table(downSample_custom(x, y, frac = 0.5)[,2])

输出：

  F   M 
522 522 

table(downSample_custom(x, y, frac = 0.2)[,2])

产出

  F   M 
208 208

使用 frac > 1 returns 一个错误：

downSample_custom(x, y, frac = 2)

产出

Error in sample.int(length(x), size, replace, prob) : cannot take a sample larger than the population when 'replace = FALSE'

编辑：更新问题的答案。

这可以通过分别对每个 class 的索引进行采样来实现。这是一个仅适用于两个 class 问题的示例：

downSample_custom <- function(x, y, yname = "Class", frac = 1){
  lev <- levels(y)
  minClass <- min(table(y))
  lev_min <- levels(y)[which.min(table(y))]
  inds_down <- sample(which(y == lev[lev != lev_min]), size = minClass * frac) #sample the indexes of the more abundant class according to minClass * frac
  inds_minClass <- which(y == lev[lev == lev_min]) #take all the indexes of the lesser abundant class
  out <- data.frame(x, y)
  out <- out[sort(c(inds_down, inds_minClass)),]
  colnames(out)[ncol(out)] <- yname
  return(out)
}

实际情况：

table(downSample_custom(x, y)[,2])

输出：

   F    M 
1044 1044 

table(downSample_custom(x, y, frac = 5)[,2])

输出：

   F    M 
5220 1044 

head(downSample_custom(x, y, frac = 5))

输出：

           x Class
1 -1.5163733     F
2  0.6291412     F
4  1.1797811     M
5  1.1176545     F
6 -1.2377359     F
7 -1.2301645     M

R-caret-plyr：如何修改downSample函数来创建不同比例的采样数据

R-caret-plyr : how to modify downSample function to create sampled data of different proportions

r

machine-learning

downsampling

plyr

r-caret