来自具有 R 中指定概率的分组数据帧的样本
Sample from a grouped dataframe with specified probabilities in R
下面,我首先将我的 data.frame (d
) 按两个分类变量分组。首先,通过 gender
(2 级;M/F)。其次,sector
(教育、工业、非政府组织、私人,Public)。然后,我想从 sector
的每个级别使用以下概率进行采样:c(.2, .3, .3, .1, .1)
和 gender
按照概率 c(.4, .6)
.
我正在使用下面的代码来实现我的目标但没有成功?有解决办法吗?
如果我的代码总体上能正确执行我描述的内容,请发表评论?
d <- read.csv('https://raw.githubusercontent.com/rnorouzian/d/master/su.csv')
library(tidyverse)
set.seed(1)
(out <- d %>%
group_by(gender,sector) %>%
slice_sample(n = 2, weight_by = c(.4, .6, .2, .3, .3, .1, .1))) # `Error: incorrect number of probabilities`
好吧 slice_sample
不会完全按照您的要求进行操作,所以我建议您使用 splitstackshape
来完成这项工作。根据需要安装和加载
# install.packages("splitstackshape")
library(splitstackshape)
有更短更快的方法来指定比例 table 但我会从所需的总样本开始有条不紊地进行,在这种情况下我们将 n = 100
然后我们将指定各种因素水平的百分比。
total_sample <- 100
M_percent <- .4
F_percent <- .6
Education_percent <- .2
Industry_percent <- .3
NGO_percent <- .3
Private_percent <- .1
Public_percent <- .1
然后我们调用函数 stratified
,首先是我们正在操作的两列的向量,然后是组的向量和我们将从上面的百分比计算的想要的数字...
abc <-
stratified(indt = d,
c("gender", "sector"),
c("F Education" = F_percent * Education_percent * total_sample,
"M Education" = M_percent * Education_percent * total_sample,
"F Industry" = F_percent * Industry_percent * total_sample,
"M Industry" = M_percent * Industry_percent * total_sample,
"F NGO" = F_percent * NGO_percent * total_sample,
"M NGO" = M_percent * NGO_percent * total_sample,
"F Private" = F_percent * Private_percent * total_sample,
"M Private" = M_percent * Private_percent * total_sample,
"F Public" = F_percent * Public_percent * total_sample,
"M Public" = M_percent * Public_percent * total_sample)
)
我们取回我们要求的随机选择的数量
head(abc, 20)
fake.name sector pretest state gender pre email phone
1: Correa Education 1254 TX F Medium Correa@...com xxx-xx-1886
2: Manzanares Education 1227 CA F Low Manzanares@...com xxx-xx-1539
3: el-Daoud Education 1409 CA F High el-Daoud@...com xxx-xx-9972
4: Engman Education 1436 CA F High Engman@...com xxx-xx-9446
5: el-Kaba Education 1305 NY F Medium el-Kaba@...com xxx-xx-7060
6: Herrera Education 1405 NY F High Herrera@...com xxx-xx-9146
7: el-Sham Education 1286 TX F Medium el-Sham@...com xxx-xx-4046
8: Harrison Education 1112 NY F Low Harrison@...com xxx-xx-3118
9: Zhu Education 1055 CA F Low Zhu@...com xxx-xx-6223
10: Deguzman Gransee Education 1312 TX F Medium Deguzman Gransee@...com xxx-xx-5676
11: Kearney Education 1303 NY F Medium Kearney@...com xxx-xx-5145
12: Hernandez Mendoza Education 1139 CA F Low Hernandez Mendoza@...com xxx-xx-9642
13: Barros Education 1416 NY M High Barros@...com xxx-xx-2455
14: Torres Education 1370 CA M High Torres@...com xxx-xx-2129
15: King Education 1346 CA M Medium King@...com xxx-xx-5351
16: Cabrera Education 1188 NY M Low Cabrera@...com xxx-xx-6349
17: Lee Education 1208 CA M Low Lee@...com xxx-xx-7713
18: Vernon Education 1216 TX M Low Vernon@...com xxx-xx-7649
19: Ripoll-Bunn Education 1419 TX M High Ripoll-Bunn@...com xxx-xx-8126
20: Ashby Education 1295 TX M Medium Ashby@...com xxx-xx-8416
下面,我首先将我的 data.frame (d
) 按两个分类变量分组。首先,通过 gender
(2 级;M/F)。其次,sector
(教育、工业、非政府组织、私人,Public)。然后,我想从 sector
的每个级别使用以下概率进行采样:c(.2, .3, .3, .1, .1)
和 gender
按照概率 c(.4, .6)
.
我正在使用下面的代码来实现我的目标但没有成功?有解决办法吗?
如果我的代码总体上能正确执行我描述的内容,请发表评论?
d <- read.csv('https://raw.githubusercontent.com/rnorouzian/d/master/su.csv')
library(tidyverse)
set.seed(1)
(out <- d %>%
group_by(gender,sector) %>%
slice_sample(n = 2, weight_by = c(.4, .6, .2, .3, .3, .1, .1))) # `Error: incorrect number of probabilities`
好吧 slice_sample
不会完全按照您的要求进行操作,所以我建议您使用 splitstackshape
来完成这项工作。根据需要安装和加载
# install.packages("splitstackshape")
library(splitstackshape)
有更短更快的方法来指定比例 table 但我会从所需的总样本开始有条不紊地进行,在这种情况下我们将 n = 100
然后我们将指定各种因素水平的百分比。
total_sample <- 100
M_percent <- .4
F_percent <- .6
Education_percent <- .2
Industry_percent <- .3
NGO_percent <- .3
Private_percent <- .1
Public_percent <- .1
然后我们调用函数 stratified
,首先是我们正在操作的两列的向量,然后是组的向量和我们将从上面的百分比计算的想要的数字...
abc <-
stratified(indt = d,
c("gender", "sector"),
c("F Education" = F_percent * Education_percent * total_sample,
"M Education" = M_percent * Education_percent * total_sample,
"F Industry" = F_percent * Industry_percent * total_sample,
"M Industry" = M_percent * Industry_percent * total_sample,
"F NGO" = F_percent * NGO_percent * total_sample,
"M NGO" = M_percent * NGO_percent * total_sample,
"F Private" = F_percent * Private_percent * total_sample,
"M Private" = M_percent * Private_percent * total_sample,
"F Public" = F_percent * Public_percent * total_sample,
"M Public" = M_percent * Public_percent * total_sample)
)
我们取回我们要求的随机选择的数量
head(abc, 20)
fake.name sector pretest state gender pre email phone
1: Correa Education 1254 TX F Medium Correa@...com xxx-xx-1886
2: Manzanares Education 1227 CA F Low Manzanares@...com xxx-xx-1539
3: el-Daoud Education 1409 CA F High el-Daoud@...com xxx-xx-9972
4: Engman Education 1436 CA F High Engman@...com xxx-xx-9446
5: el-Kaba Education 1305 NY F Medium el-Kaba@...com xxx-xx-7060
6: Herrera Education 1405 NY F High Herrera@...com xxx-xx-9146
7: el-Sham Education 1286 TX F Medium el-Sham@...com xxx-xx-4046
8: Harrison Education 1112 NY F Low Harrison@...com xxx-xx-3118
9: Zhu Education 1055 CA F Low Zhu@...com xxx-xx-6223
10: Deguzman Gransee Education 1312 TX F Medium Deguzman Gransee@...com xxx-xx-5676
11: Kearney Education 1303 NY F Medium Kearney@...com xxx-xx-5145
12: Hernandez Mendoza Education 1139 CA F Low Hernandez Mendoza@...com xxx-xx-9642
13: Barros Education 1416 NY M High Barros@...com xxx-xx-2455
14: Torres Education 1370 CA M High Torres@...com xxx-xx-2129
15: King Education 1346 CA M Medium King@...com xxx-xx-5351
16: Cabrera Education 1188 NY M Low Cabrera@...com xxx-xx-6349
17: Lee Education 1208 CA M Low Lee@...com xxx-xx-7713
18: Vernon Education 1216 TX M Low Vernon@...com xxx-xx-7649
19: Ripoll-Bunn Education 1419 TX M High Ripoll-Bunn@...com xxx-xx-8126
20: Ashby Education 1295 TX M Medium Ashby@...com xxx-xx-8416