R-caret-plyr:如何修改downSample函数来创建不同比例的采样数据
R-caret-plyr : how to modify downSample function to create sampled data of different proportions
下面是我发现here .
的caret
的downSample
函数
downSample <- function(x, y, list = FALSE, yname = "Class")
{
xc <- class(x)
if(!is.data.frame(x)) x <- as.data.frame(x)
if(!is.factor(y))
{
warning("Down-sampling requires a factor variable as the response. The original data was returned.")
return(list(x = x, y = y))
}
minClass <- min(table(y))
x$.outcome <- y
x <- ddply(x, .(y),
function(dat, n) dat[sample(seq(along = dat$.outcome), n),,drop = FALSE],
n = minClass)
y <- x$.outcome
x <- x[, !(colnames(x) %in% c("y", ".outcome")), drop = FALSE]
if(list)
{
if(xc[1] == "matrix") x <- as.matrix(x)
out <- list(x = x, y = y)
} else {
out <- cbind(x, y)
colnames(out)[ncol(out)] <- yname
}
out
}
假设我的数据集是iris
:
data(iris)
x <- iris[, -5]
y <- iris[, 5]
使响应变量成为一个极度不平衡的二元变量:
y[-c(130, 146)] <- "setosa"
因此现在有两个 "virginica"
实例和 148 个 "setosa"
实例。我想修改函数 downSample
,这样,最后,它不会返回具有 50% minClass 的二次采样数据集,而是 returns 具有例如 30% (k) 的二次采样数据集次要 class 和主要 class 的 70%。因为在 minClass 中使用 n 个实例 的 downSample
函数,它选择另一个 class 的 n 个实例 来获得一个完全平衡数据集。但就我而言,我丢失了很多数据,所以我只是想稍微平衡一下。
让我们假设 k = 20 %
即最后我想要 minClaas
的 20% 和其他 class 的 80%。我已经尝试修改这部分功能:
x <- ddply(x, .(y), function(dat, n)
dat[sample(seq(along = dat$.outcome), n),, drop = FALSE], n = minClass)
通过将 n
更改为 4*n
但我没有实现。有这个错误:
Error in size <= n/2 :
comparison (4) is possible only for atomic and list types
我们将不胜感激。
执行此操作的一个简单方法是更改 ddply
调用的 n = minClass
部分。
downSample_custom <- function(x, y, list = FALSE, yname = "Class", frac = 1){ #add argument frac which is in the 0 - 1 range
xc <- class(x)
if(!is.data.frame(x)) x <- as.data.frame(x)
if(!is.factor(y))
{
warning("Down-sampling requires a factor variable as the response. The original data was returned.")
return(list(x = x, y = y))
}
minClass <- min(table(y))
x$.outcome <- y
x <- ddply(x, .(y),
function(dat, n) dat[sample(seq(along = dat$.outcome), n),,drop = FALSE],
n = minClass*frac) #change the n to this
y <- x$.outcome
x <- x[, !(colnames(x) %in% c("y", ".outcome")), drop = FALSE]
if(list)
{
if(xc[1] == "matrix") x <- as.matrix(x)
out <- list(x = x, y = y)
} else {
out <- cbind(x, y)
colnames(out)[ncol(out)] <- yname
}
out
}
有效吗:
library(plyr)
y 不平衡:
set.seed(1)
y <- as.factor(sample(c("M", "F"),
prob = c(0.1, 0.9),
size = 10000,
replace = TRUE))
x <- rnorm(10000)
table(downSample_custom(x, y)[,2])
输出:
F M
1044 1044
table(downSample_custom(x, y, frac = 0.5)[,2])
输出:
F M
522 522
table(downSample_custom(x, y, frac = 0.2)[,2])
产出
F M
208 208
使用 frac > 1 returns 一个错误:
downSample_custom(x, y, frac = 2)
产出
Error in sample.int(length(x), size, replace, prob) :
cannot take a sample larger than the population when 'replace = FALSE'
编辑:更新问题的答案。
这可以通过分别对每个 class 的索引进行采样来实现。这是一个仅适用于两个 class 问题的示例:
downSample_custom <- function(x, y, yname = "Class", frac = 1){
lev <- levels(y)
minClass <- min(table(y))
lev_min <- levels(y)[which.min(table(y))]
inds_down <- sample(which(y == lev[lev != lev_min]), size = minClass * frac) #sample the indexes of the more abundant class according to minClass * frac
inds_minClass <- which(y == lev[lev == lev_min]) #take all the indexes of the lesser abundant class
out <- data.frame(x, y)
out <- out[sort(c(inds_down, inds_minClass)),]
colnames(out)[ncol(out)] <- yname
return(out)
}
实际情况:
table(downSample_custom(x, y)[,2])
输出:
F M
1044 1044
table(downSample_custom(x, y, frac = 5)[,2])
输出:
F M
5220 1044
head(downSample_custom(x, y, frac = 5))
输出:
x Class
1 -1.5163733 F
2 0.6291412 F
4 1.1797811 M
5 1.1176545 F
6 -1.2377359 F
7 -1.2301645 M
下面是我发现here .
的caret
的downSample
函数
downSample <- function(x, y, list = FALSE, yname = "Class")
{
xc <- class(x)
if(!is.data.frame(x)) x <- as.data.frame(x)
if(!is.factor(y))
{
warning("Down-sampling requires a factor variable as the response. The original data was returned.")
return(list(x = x, y = y))
}
minClass <- min(table(y))
x$.outcome <- y
x <- ddply(x, .(y),
function(dat, n) dat[sample(seq(along = dat$.outcome), n),,drop = FALSE],
n = minClass)
y <- x$.outcome
x <- x[, !(colnames(x) %in% c("y", ".outcome")), drop = FALSE]
if(list)
{
if(xc[1] == "matrix") x <- as.matrix(x)
out <- list(x = x, y = y)
} else {
out <- cbind(x, y)
colnames(out)[ncol(out)] <- yname
}
out
}
假设我的数据集是iris
:
data(iris)
x <- iris[, -5]
y <- iris[, 5]
使响应变量成为一个极度不平衡的二元变量:
y[-c(130, 146)] <- "setosa"
因此现在有两个 "virginica"
实例和 148 个 "setosa"
实例。我想修改函数 downSample
,这样,最后,它不会返回具有 50% minClass 的二次采样数据集,而是 returns 具有例如 30% (k) 的二次采样数据集次要 class 和主要 class 的 70%。因为在 minClass 中使用 n 个实例 的 downSample
函数,它选择另一个 class 的 n 个实例 来获得一个完全平衡数据集。但就我而言,我丢失了很多数据,所以我只是想稍微平衡一下。
让我们假设 k = 20 %
即最后我想要 minClaas
的 20% 和其他 class 的 80%。我已经尝试修改这部分功能:
x <- ddply(x, .(y), function(dat, n)
dat[sample(seq(along = dat$.outcome), n),, drop = FALSE], n = minClass)
通过将 n
更改为 4*n
但我没有实现。有这个错误:
Error in size <= n/2 : comparison (4) is possible only for atomic and list types
我们将不胜感激。
执行此操作的一个简单方法是更改 ddply
调用的 n = minClass
部分。
downSample_custom <- function(x, y, list = FALSE, yname = "Class", frac = 1){ #add argument frac which is in the 0 - 1 range
xc <- class(x)
if(!is.data.frame(x)) x <- as.data.frame(x)
if(!is.factor(y))
{
warning("Down-sampling requires a factor variable as the response. The original data was returned.")
return(list(x = x, y = y))
}
minClass <- min(table(y))
x$.outcome <- y
x <- ddply(x, .(y),
function(dat, n) dat[sample(seq(along = dat$.outcome), n),,drop = FALSE],
n = minClass*frac) #change the n to this
y <- x$.outcome
x <- x[, !(colnames(x) %in% c("y", ".outcome")), drop = FALSE]
if(list)
{
if(xc[1] == "matrix") x <- as.matrix(x)
out <- list(x = x, y = y)
} else {
out <- cbind(x, y)
colnames(out)[ncol(out)] <- yname
}
out
}
有效吗:
library(plyr)
y 不平衡:
set.seed(1)
y <- as.factor(sample(c("M", "F"),
prob = c(0.1, 0.9),
size = 10000,
replace = TRUE))
x <- rnorm(10000)
table(downSample_custom(x, y)[,2])
输出:
F M
1044 1044
table(downSample_custom(x, y, frac = 0.5)[,2])
输出:
F M
522 522
table(downSample_custom(x, y, frac = 0.2)[,2])
产出
F M
208 208
使用 frac > 1 returns 一个错误:
downSample_custom(x, y, frac = 2)
产出
Error in sample.int(length(x), size, replace, prob) : cannot take a sample larger than the population when 'replace = FALSE'
编辑:更新问题的答案。
这可以通过分别对每个 class 的索引进行采样来实现。这是一个仅适用于两个 class 问题的示例:
downSample_custom <- function(x, y, yname = "Class", frac = 1){
lev <- levels(y)
minClass <- min(table(y))
lev_min <- levels(y)[which.min(table(y))]
inds_down <- sample(which(y == lev[lev != lev_min]), size = minClass * frac) #sample the indexes of the more abundant class according to minClass * frac
inds_minClass <- which(y == lev[lev == lev_min]) #take all the indexes of the lesser abundant class
out <- data.frame(x, y)
out <- out[sort(c(inds_down, inds_minClass)),]
colnames(out)[ncol(out)] <- yname
return(out)
}
实际情况:
table(downSample_custom(x, y)[,2])
输出:
F M
1044 1044
table(downSample_custom(x, y, frac = 5)[,2])
输出:
F M
5220 1044
head(downSample_custom(x, y, frac = 5))
输出:
x Class
1 -1.5163733 F
2 0.6291412 F
4 1.1797811 M
5 1.1176545 F
6 -1.2377359 F
7 -1.2301645 M