选择率为 50% 的分层随机样本
Stratified Random Sample with a 50% Selection Rate
我有一个可能有点棘手的问题,不确定它是否超出了本帖的范围,但我想我会试一试。
我目前正在处理一个数据集,其中包括受访者 ID(其中有 972 个)、年龄组、地区、种族和性别。
我正在寻找一种方法,将每个受访者分组到每个人口统计变量中的“研究 1”或“研究 2”
所以比如下面这个数据集中,一共有43个男性。我正在寻找一种方法将这些男性平均分配给每个变量。如果我再从 13 岁到 15 岁的西方过滤到怀特,男性,还剩下四个。我想随机选择 "Study 1" 或 "Study 2" 分组,以便将这 4 个平均分配(2 个案例放入研究 1,2 个案例放入研究 2)。我也想对其他情况这样做。如果案例数量为奇数,我想将它们平均分配(因此,如果中西部有 3 名 7 至 9 岁的白人男性,则两个案例将进入研究 1,另一个进入研究 2,或者反之亦然)。
如果我使用其他过滤器的不同组合,则此分层规则需要成立(假设这 972 名受访者中有 13 名来自南方且年龄在 7 到 9 岁之间的西班牙裔女性),我需要拆分该样本,以便其中 7 个受访者在研究 1 中,其余 6 个在研究 2 中。
我不确定这是否超出了本论坛的范围,但我想我会咨询一些专家。
我试过在 Excel 中使用 "MOD" 函数,这让我有些了解,但它并没有按照我想要的方式拆分样本。
data <- read.table(text =
"ID Age Gender Race Region Desired
370 4788 16to18 Male Hispani West Study1
371 4858 4to6 Male Hispani Northeast Study1
372 4863 7to9 Male Hispani South Study1
373 4884 10to12 Female Hispani Northeast Study1
374 4911 4to6 Female Hispani Northeast Study1
375 4967 13to15 Female Hispani West Study1
376 4980 4to6 Male Hispani South Study1
377 5054 13to15 Male Hispani Midwest Study1
378 5074 4to6 Male Hispani Northeast Study2
583 930 4to6 Female White Northeast Study1
584 931 7to9 Male White South Study1
585 937 4to6 Male White South Study1
586 938 10to12 Male White Midwest Study1
587 939 13to15 Male White Northeast Study1
588 941 16to18 Male White West Study1
589 944 10to12 Female White Midwest Study1
590 946 4to6 Male White Midwest Study1
591 949 13to15 Female White West Study1
592 952 16to18 Male White Northeast Study1
593 953 13to15 Female White South Study1
594 959 10to12 Male White Northeast Study1
595 957 10to12 Female White South Study1
596 961 16to18 Female White Midwest Study1
597 963 13to15 Male White South Study1
598 965 7to9 Male White Midwest Study1
599 971 13to15 Female White West Study2
600 976 13to15 Male White South Study2
601 982 16to18 Female White Midwest Study2
602 983 10to12 Female White Northeast Study1
603 986 13to15 Male White West Study1
604 992 10to12 Female White West Study1
605 994 4to6 Female White Midwest Study1
606 997 13to15 Male White West Study2
607 999 10to12 Male White South Study1
608 1013 10to12 Male White West Study1
609 1011 4to6 Female White Northeast Study2
610 1016 7to9 Female White West Study2
611 1022 16to18 Male White South Study1
612 1023 7to9 Male White Northeast Study1
613 1026 16to18 Female White West Study1
614 1027 7to9 Male White West Study1
615 1030 4to6 Male White Northeast Study1
616 1033 10to12 Female White Midwest Study2
617 1034 13to15 Male White Midwest Study1
618 1036 7to9 Female White West Study1
619 1039 16to18 Female White Northeast Study1
620 1042 16to18 Female White West Study2
621 1044 10to12 Female White South Study2
622 1049 13to15 Female White Northeast Study1
623 1050 4to6 Female White South Study1
624 1051 7to9 Male White South Study2
625 1052 13to15 Male White Northeast Study2
626 1053 10to12 Male White South Study2
627 1054 13to15 Male White West Study1
628 1055 7to9 Female White South Study1
629 1058 10to12 Male White South Study1
630 1061 16to18 Male White Midwest Study1
631 1062 10to12 Male White South Study2
632 1066 7to9 Male White South Study1
633 1067 13to15 Male White South Study1
634 1071 16to18 Male White South Study2
635 1072 16to18 Female White Midwest Study1
636 1074 10to12 Female White South Study1
637 1075 10to12 Female White Northeast Study2
638 1078 16to18 Female White Midwest Study2
639 1080 7to9 Male White South Study2
640 1083 4to6 Female White South Study2
641 1093 7to9 Female White Midwest Study1
642 1097 4to6 Female White West Study1
643 1102 10to12 Male White Midwest Study2
644 1104 13to15 Male White West Study2
645 1105 7to9 Male White Midwest Study2
646 1110 13to15 Male White Northeast Study1
647 1113 7to9 Female White Midwest Study2
648 1119 10to12 Female White West Study2
649 1120 10to12 Male White West Study2
650 1122 13to15 Female White West Study1
651 1124 16to18 Female White Midwest Study1
721 1384 7to9 Male White South Study1" , stringsAsFactors=F, header = T)
这个论坛的问题很好。并为可重现的例子点赞!
这是解决此问题的一种方法。我强烈推荐 tidyverse
包,它有很多很棒的功能。
library(tidyverse) # load the tidyverse library, if you don't have it, install it first
# take your data,
Study1 <- data %>%
# group by these variables
group_by(Age, Gender, Race, Region) %>%
# sample 50 percent of each group
sample_frac(0.5) %>%
# extract a vector that corresponds to the IDs of the sampled participants.
pull(ID)
Study1 # These are all participants for study 1
# now, give each person either "Study1" or "Study2"
# If the person's ID is in the vector "Study1", make the value of a new
# variable, "Study", equal to "Study1". If their ID is NOT in that vector,
# then make them part of "Study2".
data <- data %>%
mutate(Study = ifelse(ID %in% Study1, "Study1", "Study2"))
您的示例数据很好,但它没有提供足够的可变性来让您在每个组合中都有分布。这可能只是盲目运气或您提供的样本的一个因素。无论哪种方式,这个答案的前提都不会因为演示而改变。
我假设您不需要 Desired
列中的完全匹配,只是为了在每个分层中均匀分布 Study
。
我将使用 dplyr
,因为我认为每个步骤都清楚要完成的工作。可以为它使用 sample_frac
或 runif(n()) < 0.5
,但不能保证你会得到统一的分布。在此实现中,我只是对所有行进行随机排序,并在所有行中分配 1 或 2 变量。基于此,如果没有特定的因素组合,研究 1 和研究 2 之间的差异绝不会超过 1。
为了演示每组 n
,我将简化为两个因素:年龄和性别。
library(dplyr)
set.seed(2) # for reproducibility only, do not include in production code
studies <- 1:2
out <- data %>%
sample_n(n()) %>%
group_by(Age, Gender) %>%
mutate(Study = rep(studies, length.out = n())) %>%
ungroup()
arrange(out, ID)
# # A tibble: 79 x 7
# ID Age Gender Race Region Desired Study
# <int> <chr> <chr> <chr> <chr> <chr> <int>
# 1 930 4to6 Female White Northeast Study1 1
# 2 931 7to9 Male White South Study1 1
# 3 937 4to6 Male White South Study1 2
# 4 938 10to12 Male White Midwest Study1 1
# 5 939 13to15 Male White Northeast Study1 2
# 6 941 16to18 Male White West Study1 1
# 7 944 10to12 Female White Midwest Study1 1
# 8 946 4to6 Male White Midwest Study1 1
# 9 949 13to15 Female White West Study1 2
# 10 952 16to18 Male White Northeast Study1 1
# # ... with 69 more rows
我们可以查看它是否有效的一种方法是将其制成表格。原始数据:
xtabs(~ Gender + Age, data = data)
# Age
# Gender 10to12 13to15 16to18 4to6 7to9
# Female 10 6 8 7 5
# Male 9 12 6 6 10
以及为每项研究选择的那些,显示两项研究之间的平均分布:
xtabs(~ Study + Age + Gender, data = out)
# , , Gender = Female
# Age
# Study 10to12 13to15 16to18 4to6 7to9
# 1 5 3 4 4 3
# 2 5 3 4 3 2
# , , Gender = Male
# Age
# Study 10to12 13to15 16to18 4to6 7to9
# 1 5 6 3 3 5
# 2 4 6 3 3 5
并证明在任何一个层中都不会超过 1 more/less:
group_by(out, Age, Gender) %>% summarize(differences = diff(range(table(Study))))
# # A tibble: 10 x 3
# # Groups: Age [5]
# Age Gender differences
# <chr> <chr> <int>
# 1 10to12 Female 0
# 2 10to12 Male 1
# 3 13to15 Female 0
# 4 13to15 Male 0
# 5 16to18 Female 0
# 6 16to18 Male 0
# 7 4to6 Female 1
# 8 4to6 Male 0
# 9 7to9 Female 1
# 10 7to9 Male 0
我重复了多达 10 项不同的研究,并且同一层内的研究之间从来没有超过 +/- 1。
对于您希望保留所有四个因素的实施,您将使用:
out <- data %>%
sample_n(n()) %>%
group_by(Age, Gender, Race, Region) %>% # <--- the only difference
mutate(Study = rep(studies, length.out = n())) %>%
ungroup()
我应该补充一点,这也适用于两项以上的研究(例如,students <- 1:3
:sample_n
和 rep(..., length.out=)
的组合使用确保你永远不会有每个层次的研究之间的差异超过 1。
我有一个可能有点棘手的问题,不确定它是否超出了本帖的范围,但我想我会试一试。
我目前正在处理一个数据集,其中包括受访者 ID(其中有 972 个)、年龄组、地区、种族和性别。
我正在寻找一种方法,将每个受访者分组到每个人口统计变量中的“研究 1”或“研究 2”
所以比如下面这个数据集中,一共有43个男性。我正在寻找一种方法将这些男性平均分配给每个变量。如果我再从 13 岁到 15 岁的西方过滤到怀特,男性,还剩下四个。我想随机选择 "Study 1" 或 "Study 2" 分组,以便将这 4 个平均分配(2 个案例放入研究 1,2 个案例放入研究 2)。我也想对其他情况这样做。如果案例数量为奇数,我想将它们平均分配(因此,如果中西部有 3 名 7 至 9 岁的白人男性,则两个案例将进入研究 1,另一个进入研究 2,或者反之亦然)。
如果我使用其他过滤器的不同组合,则此分层规则需要成立(假设这 972 名受访者中有 13 名来自南方且年龄在 7 到 9 岁之间的西班牙裔女性),我需要拆分该样本,以便其中 7 个受访者在研究 1 中,其余 6 个在研究 2 中。
我不确定这是否超出了本论坛的范围,但我想我会咨询一些专家。
我试过在 Excel 中使用 "MOD" 函数,这让我有些了解,但它并没有按照我想要的方式拆分样本。
data <- read.table(text =
"ID Age Gender Race Region Desired
370 4788 16to18 Male Hispani West Study1
371 4858 4to6 Male Hispani Northeast Study1
372 4863 7to9 Male Hispani South Study1
373 4884 10to12 Female Hispani Northeast Study1
374 4911 4to6 Female Hispani Northeast Study1
375 4967 13to15 Female Hispani West Study1
376 4980 4to6 Male Hispani South Study1
377 5054 13to15 Male Hispani Midwest Study1
378 5074 4to6 Male Hispani Northeast Study2
583 930 4to6 Female White Northeast Study1
584 931 7to9 Male White South Study1
585 937 4to6 Male White South Study1
586 938 10to12 Male White Midwest Study1
587 939 13to15 Male White Northeast Study1
588 941 16to18 Male White West Study1
589 944 10to12 Female White Midwest Study1
590 946 4to6 Male White Midwest Study1
591 949 13to15 Female White West Study1
592 952 16to18 Male White Northeast Study1
593 953 13to15 Female White South Study1
594 959 10to12 Male White Northeast Study1
595 957 10to12 Female White South Study1
596 961 16to18 Female White Midwest Study1
597 963 13to15 Male White South Study1
598 965 7to9 Male White Midwest Study1
599 971 13to15 Female White West Study2
600 976 13to15 Male White South Study2
601 982 16to18 Female White Midwest Study2
602 983 10to12 Female White Northeast Study1
603 986 13to15 Male White West Study1
604 992 10to12 Female White West Study1
605 994 4to6 Female White Midwest Study1
606 997 13to15 Male White West Study2
607 999 10to12 Male White South Study1
608 1013 10to12 Male White West Study1
609 1011 4to6 Female White Northeast Study2
610 1016 7to9 Female White West Study2
611 1022 16to18 Male White South Study1
612 1023 7to9 Male White Northeast Study1
613 1026 16to18 Female White West Study1
614 1027 7to9 Male White West Study1
615 1030 4to6 Male White Northeast Study1
616 1033 10to12 Female White Midwest Study2
617 1034 13to15 Male White Midwest Study1
618 1036 7to9 Female White West Study1
619 1039 16to18 Female White Northeast Study1
620 1042 16to18 Female White West Study2
621 1044 10to12 Female White South Study2
622 1049 13to15 Female White Northeast Study1
623 1050 4to6 Female White South Study1
624 1051 7to9 Male White South Study2
625 1052 13to15 Male White Northeast Study2
626 1053 10to12 Male White South Study2
627 1054 13to15 Male White West Study1
628 1055 7to9 Female White South Study1
629 1058 10to12 Male White South Study1
630 1061 16to18 Male White Midwest Study1
631 1062 10to12 Male White South Study2
632 1066 7to9 Male White South Study1
633 1067 13to15 Male White South Study1
634 1071 16to18 Male White South Study2
635 1072 16to18 Female White Midwest Study1
636 1074 10to12 Female White South Study1
637 1075 10to12 Female White Northeast Study2
638 1078 16to18 Female White Midwest Study2
639 1080 7to9 Male White South Study2
640 1083 4to6 Female White South Study2
641 1093 7to9 Female White Midwest Study1
642 1097 4to6 Female White West Study1
643 1102 10to12 Male White Midwest Study2
644 1104 13to15 Male White West Study2
645 1105 7to9 Male White Midwest Study2
646 1110 13to15 Male White Northeast Study1
647 1113 7to9 Female White Midwest Study2
648 1119 10to12 Female White West Study2
649 1120 10to12 Male White West Study2
650 1122 13to15 Female White West Study1
651 1124 16to18 Female White Midwest Study1
721 1384 7to9 Male White South Study1" , stringsAsFactors=F, header = T)
这个论坛的问题很好。并为可重现的例子点赞!
这是解决此问题的一种方法。我强烈推荐 tidyverse
包,它有很多很棒的功能。
library(tidyverse) # load the tidyverse library, if you don't have it, install it first
# take your data,
Study1 <- data %>%
# group by these variables
group_by(Age, Gender, Race, Region) %>%
# sample 50 percent of each group
sample_frac(0.5) %>%
# extract a vector that corresponds to the IDs of the sampled participants.
pull(ID)
Study1 # These are all participants for study 1
# now, give each person either "Study1" or "Study2"
# If the person's ID is in the vector "Study1", make the value of a new
# variable, "Study", equal to "Study1". If their ID is NOT in that vector,
# then make them part of "Study2".
data <- data %>%
mutate(Study = ifelse(ID %in% Study1, "Study1", "Study2"))
您的示例数据很好,但它没有提供足够的可变性来让您在每个组合中都有分布。这可能只是盲目运气或您提供的样本的一个因素。无论哪种方式,这个答案的前提都不会因为演示而改变。
我假设您不需要 Desired
列中的完全匹配,只是为了在每个分层中均匀分布 Study
。
我将使用 dplyr
,因为我认为每个步骤都清楚要完成的工作。可以为它使用 sample_frac
或 runif(n()) < 0.5
,但不能保证你会得到统一的分布。在此实现中,我只是对所有行进行随机排序,并在所有行中分配 1 或 2 变量。基于此,如果没有特定的因素组合,研究 1 和研究 2 之间的差异绝不会超过 1。
为了演示每组 n
,我将简化为两个因素:年龄和性别。
library(dplyr)
set.seed(2) # for reproducibility only, do not include in production code
studies <- 1:2
out <- data %>%
sample_n(n()) %>%
group_by(Age, Gender) %>%
mutate(Study = rep(studies, length.out = n())) %>%
ungroup()
arrange(out, ID)
# # A tibble: 79 x 7
# ID Age Gender Race Region Desired Study
# <int> <chr> <chr> <chr> <chr> <chr> <int>
# 1 930 4to6 Female White Northeast Study1 1
# 2 931 7to9 Male White South Study1 1
# 3 937 4to6 Male White South Study1 2
# 4 938 10to12 Male White Midwest Study1 1
# 5 939 13to15 Male White Northeast Study1 2
# 6 941 16to18 Male White West Study1 1
# 7 944 10to12 Female White Midwest Study1 1
# 8 946 4to6 Male White Midwest Study1 1
# 9 949 13to15 Female White West Study1 2
# 10 952 16to18 Male White Northeast Study1 1
# # ... with 69 more rows
我们可以查看它是否有效的一种方法是将其制成表格。原始数据:
xtabs(~ Gender + Age, data = data)
# Age
# Gender 10to12 13to15 16to18 4to6 7to9
# Female 10 6 8 7 5
# Male 9 12 6 6 10
以及为每项研究选择的那些,显示两项研究之间的平均分布:
xtabs(~ Study + Age + Gender, data = out)
# , , Gender = Female
# Age
# Study 10to12 13to15 16to18 4to6 7to9
# 1 5 3 4 4 3
# 2 5 3 4 3 2
# , , Gender = Male
# Age
# Study 10to12 13to15 16to18 4to6 7to9
# 1 5 6 3 3 5
# 2 4 6 3 3 5
并证明在任何一个层中都不会超过 1 more/less:
group_by(out, Age, Gender) %>% summarize(differences = diff(range(table(Study))))
# # A tibble: 10 x 3
# # Groups: Age [5]
# Age Gender differences
# <chr> <chr> <int>
# 1 10to12 Female 0
# 2 10to12 Male 1
# 3 13to15 Female 0
# 4 13to15 Male 0
# 5 16to18 Female 0
# 6 16to18 Male 0
# 7 4to6 Female 1
# 8 4to6 Male 0
# 9 7to9 Female 1
# 10 7to9 Male 0
我重复了多达 10 项不同的研究,并且同一层内的研究之间从来没有超过 +/- 1。
对于您希望保留所有四个因素的实施,您将使用:
out <- data %>%
sample_n(n()) %>%
group_by(Age, Gender, Race, Region) %>% # <--- the only difference
mutate(Study = rep(studies, length.out = n())) %>%
ungroup()
我应该补充一点,这也适用于两项以上的研究(例如,students <- 1:3
:sample_n
和 rep(..., length.out=)
的组合使用确保你永远不会有每个层次的研究之间的差异超过 1。