如何使用 [=10r 按比例拆分数据

Question

我想按比例拆分我拥有的数据。例如，我有 100 行，我想每两行随机抽取 1 行。使用 tidymodels rsample 我假设我会做下面的事情。

dat <- as_tibble(seq(1:100))

split <- inital_split(dat, prop = 0.5, breaks = 50)

testing <- testing(split)

检查数据时，拆分并没有达到我的预期。看起来很接近但不完全是。我认为 breaks 调用会生成从中采样的 bin。因此，breaks = 50 会将 100 行分成 50 个 bin，因此每个 bin 有两行。我也试过 strata = value 跨行移动，但我也无法让它工作。

我用这个作为例子，但我也很好奇当每四行采样 1 行时这将如何工作等等。

我是不是没理解 breaks 调用函数？

Answer 1

有一个论点可以防止用户尝试创建太小而你运行反对的分层拆分；它叫做 pool:

library(rsample)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

dat <- tibble(value = seq(1:100), strat = as.factor(rep(1:50, each = 2))) 
dat
#> # A tibble: 100 × 2
#>    value strat
#>    <int> <fct>
#>  1     1 1    
#>  2     2 1    
#>  3     3 2    
#>  4     4 2    
#>  5     5 3    
#>  6     6 3    
#>  7     7 4    
#>  8     8 4    
#>  9     9 5    
#> 10    10 5    
#> # … with 90 more rows

split <- initial_split(dat, prop = 0.5, strata = strat, pool = 0.0)
#> Warning: Stratifying groups that make up 0% of the data may be statistically risky.
#> • Consider increasing `pool` to at least 0.1
split
#> <Analysis/Assess/Total>
#> <50/50/100>

training(split) %>% arrange(strat)
#> # A tibble: 50 × 2
#>    value strat
#>    <int> <fct>
#>  1     1 1    
#>  2     4 2    
#>  3     5 3    
#>  4     8 4    
#>  5    10 5    
#>  6    12 6    
#>  7    13 7    
#>  8    16 8    
#>  9    17 9    
#> 10    20 10   
#> # … with 40 more rows
testing(split) %>% arrange(strat)
#> # A tibble: 50 × 2
#>    value strat
#>    <int> <fct>
#>  1     2 1    
#>  2     3 2    
#>  3     6 3    
#>  4     7 4    
#>  5     9 5    
#>  6    11 6    
#>  7    14 7    
#>  8    15 8    
#>  9    18 9    
#> 10    19 10   
#> # … with 40 more rows

^{由 reprex package (v2.0.1)}

于 2022-02-22 创建

我们真的不建议像这样将 pool 调低至零，但您可以在此处执行此操作以查看 strata 和 prop 参数的工作原理。

如何使用 [=10r 按比例拆分数据

How to propotionally split data using initial_split r

r

train-test-split

tidymodels

rsample