R data.table - 不同抽样比例的分组抽样
R data.table - sample by group with different sampling proportion
我想从 data.table
中高效地按组随机抽样,但应该可以为每个组抽取不同比例的样本。
如果我想从每个组中抽取分数 sampling_fraction
,我可以从 this question and related 的答案中得到启发,做如下事情:
DT = data.table(a = sample(1:2), b = sample(1:1000,20))
group_sampler <- function(data, group_col, sample_fraction){
# this function samples sample_fraction <0,1> from each group in the data.table
# inputs:
# data - data.table
# group_col - column(s) used to group by
# sample_fraction - a value between 0 and 1 indicating what % of each group should be sampled
data[,.SD[sample(.N, ceiling(.N*sample_fraction))],by = eval(group_col)]
}
# what % of data should be sampled
sampling_fraction = 0.5
# perform the sampling
sampled_dt <- group_sampler(DT, 'a', sampling_fraction)
但是如果我想从第 1 组抽样 10%,从第 2 组抽样 50% 怎么办?
您可以使用 .GRP
,但要确保匹配正确的组。您可能需要将 group_col
定义为因子变量。
group_sampler <- function(data, group_col, sample_fractions) {
# this function samples sample_fraction <0,1> from each group in the data.table
# inputs:
# data - data.table
# group_col - column(s) used to group by
# sample_fraction - a value between 0 and 1 indicating what % of each group should be sampled
stopifnot(length(sample_fractions) == uniqueN(data[[group_col]]))
data[, .SD[sample(.N, ceiling(.N*sample_fractions[.GRP]))], keyby = group_col]
}
编辑以回应 chinsoon12 的评论:
让函数的最后一行更安全(而不是依赖正确的顺序):
data[, .SD[sample(.N, ceiling(.N*sample_fractions[[unlist(.BY)]]))], keyby = group_col]
然后将 sample_fractions
作为命名向量传递:
group_sampler(DT, 'a', sample_fractions= c(x = 0.1, y = 0.9))
这是一个使用查找 table 的选项(因此不依赖于向量或组的排序)。
library(data.table)
DT = data.table(group = sample(1:2), val = sample(1:1000,20))
sample_props <- data.table(group = 1:2, prop = c(.1,.5))
group_sampler <- function(data, group_col, sample_props){
# this function samples sample_fraction <0,1> from each group in the data.table
# inputs:
# data - data.table with data
# group_col - column(s) used to group by (must be in both data.tables)
# sample_props - data.table with sample proportions
ret <- merge(DT, sample_props, by = group_col)
ret <- ret[,.SD[sample(.N, ceiling(.N*prop))], eval(group_col)]
return(ret[,prop := NULL][])
}
# perform the sampling
group_sampler(DT, 'group', sample_props)
#> group val
#> 1: 1 721
#> 2: 2 542
#> 3: 2 680
#> 4: 2 613
#> 5: 2 170
#> 6: 2 175
由 reprex package (v0.3.0)
于 2019-10-15 创建
我想从 data.table
中高效地按组随机抽样,但应该可以为每个组抽取不同比例的样本。
如果我想从每个组中抽取分数 sampling_fraction
,我可以从 this question and related 的答案中得到启发,做如下事情:
DT = data.table(a = sample(1:2), b = sample(1:1000,20))
group_sampler <- function(data, group_col, sample_fraction){
# this function samples sample_fraction <0,1> from each group in the data.table
# inputs:
# data - data.table
# group_col - column(s) used to group by
# sample_fraction - a value between 0 and 1 indicating what % of each group should be sampled
data[,.SD[sample(.N, ceiling(.N*sample_fraction))],by = eval(group_col)]
}
# what % of data should be sampled
sampling_fraction = 0.5
# perform the sampling
sampled_dt <- group_sampler(DT, 'a', sampling_fraction)
但是如果我想从第 1 组抽样 10%,从第 2 组抽样 50% 怎么办?
您可以使用 .GRP
,但要确保匹配正确的组。您可能需要将 group_col
定义为因子变量。
group_sampler <- function(data, group_col, sample_fractions) {
# this function samples sample_fraction <0,1> from each group in the data.table
# inputs:
# data - data.table
# group_col - column(s) used to group by
# sample_fraction - a value between 0 and 1 indicating what % of each group should be sampled
stopifnot(length(sample_fractions) == uniqueN(data[[group_col]]))
data[, .SD[sample(.N, ceiling(.N*sample_fractions[.GRP]))], keyby = group_col]
}
编辑以回应 chinsoon12 的评论:
让函数的最后一行更安全(而不是依赖正确的顺序):
data[, .SD[sample(.N, ceiling(.N*sample_fractions[[unlist(.BY)]]))], keyby = group_col]
然后将 sample_fractions
作为命名向量传递:
group_sampler(DT, 'a', sample_fractions= c(x = 0.1, y = 0.9))
这是一个使用查找 table 的选项(因此不依赖于向量或组的排序)。
library(data.table)
DT = data.table(group = sample(1:2), val = sample(1:1000,20))
sample_props <- data.table(group = 1:2, prop = c(.1,.5))
group_sampler <- function(data, group_col, sample_props){
# this function samples sample_fraction <0,1> from each group in the data.table
# inputs:
# data - data.table with data
# group_col - column(s) used to group by (must be in both data.tables)
# sample_props - data.table with sample proportions
ret <- merge(DT, sample_props, by = group_col)
ret <- ret[,.SD[sample(.N, ceiling(.N*prop))], eval(group_col)]
return(ret[,prop := NULL][])
}
# perform the sampling
group_sampler(DT, 'group', sample_props)
#> group val
#> 1: 1 721
#> 2: 2 542
#> 3: 2 680
#> 4: 2 613
#> 5: 2 170
#> 6: 2 175
由 reprex package (v0.3.0)
于 2019-10-15 创建