来自具有 R 中指定概率的分组数据帧的样本

Question

下面，我首先将我的 data.frame (d) 按两个分类变量分组。首先，通过 gender（2 级；M/F）。其次，sector（教育、工业、非政府组织、私人，Public）。然后，我想从 sector 的每个级别使用以下概率进行采样：c(.2, .3, .3, .1, .1) 和 gender 按照概率 c(.4, .6).

我正在使用下面的代码来实现我的目标但没有成功？有解决办法吗？

如果我的代码总体上能正确执行我描述的内容，请发表评论？

d <- read.csv('https://raw.githubusercontent.com/rnorouzian/d/master/su.csv')

library(tidyverse)

set.seed(1)
(out <- d %>%
  group_by(gender,sector) %>%
  slice_sample(n = 2, weight_by = c(.4, .6, .2, .3, .3, .1, .1))) # `Error:  incorrect number of probabilities`

Answer 1

好吧 slice_sample 不会完全按照您的要求进行操作，所以我建议您使用 splitstackshape 来完成这项工作。根据需要安装和加载

# install.packages("splitstackshape")
library(splitstackshape)

有更短更快的方法来指定比例 table 但我会从所需的总样本开始有条不紊地进行，在这种情况下我们将 n = 100 然后我们将指定各种因素水平的百分比。

total_sample <- 100
M_percent <- .4
F_percent <- .6
Education_percent <- .2
Industry_percent <- .3
NGO_percent <- .3
Private_percent <- .1
Public_percent <- .1

然后我们调用函数 stratified，首先是我们正在操作的两列的向量，然后是组的向量和我们将从上面的百分比计算的想要的数字...

abc <- 
   stratified(indt = d, 
              c("gender", "sector"), 
              c("F Education" = F_percent * Education_percent * total_sample, 
                "M Education" = M_percent * Education_percent * total_sample,
                "F Industry" = F_percent * Industry_percent * total_sample, 
                "M Industry" = M_percent * Industry_percent * total_sample,
                "F NGO" = F_percent * NGO_percent * total_sample, 
                "M NGO" = M_percent * NGO_percent * total_sample,
                "F Private" = F_percent * Private_percent * total_sample, 
                "M Private" = M_percent * Private_percent * total_sample,
                "F Public" = F_percent * Public_percent * total_sample, 
                "M Public" = M_percent * Public_percent * total_sample)
              )

我们取回我们要求的随机选择的数量

head(abc, 20)
            fake.name    sector pretest state gender    pre                    email       phone
 1:            Correa Education    1254    TX      F Medium            Correa@...com xxx-xx-1886
 2:        Manzanares Education    1227    CA      F    Low        Manzanares@...com xxx-xx-1539
 3:          el-Daoud Education    1409    CA      F   High          el-Daoud@...com xxx-xx-9972
 4:            Engman Education    1436    CA      F   High            Engman@...com xxx-xx-9446
 5:           el-Kaba Education    1305    NY      F Medium           el-Kaba@...com xxx-xx-7060
 6:           Herrera Education    1405    NY      F   High           Herrera@...com xxx-xx-9146
 7:           el-Sham Education    1286    TX      F Medium           el-Sham@...com xxx-xx-4046
 8:          Harrison Education    1112    NY      F    Low          Harrison@...com xxx-xx-3118
 9:               Zhu Education    1055    CA      F    Low               Zhu@...com xxx-xx-6223
10:  Deguzman Gransee Education    1312    TX      F Medium  Deguzman Gransee@...com xxx-xx-5676
11:           Kearney Education    1303    NY      F Medium           Kearney@...com xxx-xx-5145
12: Hernandez Mendoza Education    1139    CA      F    Low Hernandez Mendoza@...com xxx-xx-9642
13:            Barros Education    1416    NY      M   High            Barros@...com xxx-xx-2455
14:            Torres Education    1370    CA      M   High            Torres@...com xxx-xx-2129
15:              King Education    1346    CA      M Medium              King@...com xxx-xx-5351
16:           Cabrera Education    1188    NY      M    Low           Cabrera@...com xxx-xx-6349
17:               Lee Education    1208    CA      M    Low               Lee@...com xxx-xx-7713
18:            Vernon Education    1216    TX      M    Low            Vernon@...com xxx-xx-7649
19:       Ripoll-Bunn Education    1419    TX      M   High       Ripoll-Bunn@...com xxx-xx-8126
20:             Ashby Education    1295    TX      M Medium             Ashby@...com xxx-xx-8416

来自具有 R 中指定概率的分组数据帧的样本

Sample from a grouped dataframe with specified probabilities in R

random

r

sampling

dataframe

tidyverse