选择率为 50% 的分层随机样本

Question

我有一个可能有点棘手的问题，不确定它是否超出了本帖的范围，但我想我会试一试。

我目前正在处理一个数据集，其中包括受访者 ID（其中有 972 个）、年龄组、地区、种族和性别。

我正在寻找一种方法，将每个受访者分组到每个人口统计变量中的“研究 1”或“研究 2”

所以比如下面这个数据集中，一共有43个男性。我正在寻找一种方法将这些男性平均分配给每个变量。如果我再从 13 岁到 15 岁的西方过滤到怀特，男性，还剩下四个。我想随机选择 "Study 1" 或 "Study 2" 分组，以便将这 4 个平均分配（2 个案例放入研究 1，2 个案例放入研究 2）。我也想对其他情况这样做。如果案例数量为奇数，我想将它们平均分配（因此，如果中西部有 3 名 7 至 9 岁的白人男性，则两个案例将进入研究 1，另一个进入研究 2，或者反之亦然）。

如果我使用其他过滤器的不同组合，则此分层规则需要成立（假设这 972 名受访者中有 13 名来自南方且年龄在 7 到 9 岁之间的西班牙裔女性），我需要拆分该样本，以便其中 7 个受访者在研究 1 中，其余 6 个在研究 2 中。

我不确定这是否超出了本论坛的范围，但我想我会咨询一些专家。

我试过在 Excel 中使用 "MOD" 函数，这让我有些了解，但它并没有按照我想要的方式拆分样本。

data <- read.table(text = 
    "ID   Age    Gender     Race    Region        Desired    
370 4788  16to18   Male    Hispani    West          Study1
371 4858  4to6     Male    Hispani    Northeast     Study1
372 4863  7to9     Male    Hispani    South         Study1
373 4884  10to12   Female  Hispani    Northeast     Study1
374 4911  4to6     Female  Hispani    Northeast     Study1
375 4967  13to15   Female  Hispani    West          Study1
376 4980  4to6     Male    Hispani    South         Study1
377 5054  13to15   Male    Hispani    Midwest       Study1
378 5074  4to6     Male    Hispani    Northeast     Study2
583 930   4to6     Female  White      Northeast     Study1
584 931   7to9     Male    White      South         Study1
585 937   4to6     Male    White      South         Study1
586 938   10to12   Male    White      Midwest       Study1
587 939   13to15   Male    White      Northeast     Study1
588 941   16to18   Male    White      West          Study1
589 944   10to12   Female  White      Midwest       Study1
590 946   4to6     Male    White      Midwest       Study1
591 949   13to15   Female  White      West          Study1
592 952   16to18   Male    White      Northeast     Study1
593 953   13to15   Female  White      South         Study1
594 959   10to12   Male    White      Northeast     Study1
595 957   10to12   Female  White      South         Study1
596 961   16to18   Female  White      Midwest       Study1
597 963   13to15   Male    White      South         Study1
598 965   7to9     Male    White      Midwest       Study1
599 971   13to15   Female  White      West          Study2
600 976   13to15   Male    White      South         Study2
601 982   16to18   Female  White      Midwest       Study2
602 983   10to12   Female  White      Northeast     Study1
603 986   13to15   Male    White      West          Study1
604 992   10to12   Female  White      West          Study1
605 994   4to6     Female  White      Midwest       Study1
606 997   13to15   Male    White      West          Study2
607 999   10to12   Male    White      South         Study1
608 1013  10to12   Male    White      West          Study1
609 1011  4to6     Female  White      Northeast     Study2
610 1016  7to9     Female  White      West          Study2
611 1022  16to18   Male    White      South         Study1
612 1023  7to9     Male    White      Northeast     Study1
613 1026  16to18   Female  White      West          Study1
614 1027  7to9     Male    White      West          Study1
615 1030  4to6     Male    White      Northeast     Study1
616 1033  10to12   Female  White      Midwest       Study2
617 1034  13to15   Male    White      Midwest       Study1
618 1036  7to9     Female  White      West          Study1
619 1039  16to18   Female  White      Northeast     Study1
620 1042  16to18   Female  White      West          Study2
621 1044  10to12   Female  White      South         Study2
622 1049  13to15   Female  White      Northeast     Study1
623 1050  4to6     Female  White      South         Study1
624 1051  7to9     Male    White      South         Study2
625 1052  13to15   Male    White      Northeast     Study2
626 1053  10to12   Male    White      South         Study2
627 1054  13to15   Male    White      West          Study1
628 1055  7to9     Female  White      South         Study1
629 1058  10to12   Male    White      South         Study1
630 1061  16to18   Male    White      Midwest       Study1
631 1062  10to12   Male    White      South         Study2
632 1066  7to9     Male    White      South         Study1
633 1067  13to15   Male    White      South         Study1
634 1071  16to18   Male    White      South         Study2
635 1072  16to18   Female  White      Midwest       Study1
636 1074  10to12   Female  White      South         Study1
637 1075  10to12   Female  White      Northeast     Study2
638 1078  16to18   Female  White      Midwest       Study2
639 1080  7to9     Male    White      South         Study2
640 1083  4to6     Female  White      South         Study2
641 1093  7to9     Female  White      Midwest       Study1
642 1097  4to6     Female  White      West          Study1
643 1102  10to12   Male    White      Midwest       Study2
644 1104  13to15   Male    White      West          Study2
645 1105  7to9     Male    White      Midwest       Study2
646 1110  13to15   Male    White      Northeast     Study1
647 1113  7to9     Female  White      Midwest       Study2
648 1119  10to12   Female  White      West          Study2
649 1120  10to12   Male    White      West          Study2
650 1122  13to15   Female  White      West          Study1
651 1124  16to18   Female  White      Midwest       Study1
721 1384  7to9     Male    White      South         Study1" , stringsAsFactors=F, header = T)

Answer 1

这个论坛的问题很好。并为可重现的例子点赞！

这是解决此问题的一种方法。我强烈推荐 tidyverse 包，它有很多很棒的功能。

library(tidyverse)  # load the tidyverse library, if you don't have it, install it first

# take your data,
Study1 <- data %>% 
  # group by these variables
  group_by(Age, Gender, Race, Region) %>% 
  # sample 50 percent of each group
  sample_frac(0.5) %>% 
  # extract a vector that corresponds to the IDs of the sampled participants.
  pull(ID)

Study1  # These are all participants for study 1

# now, give each person either "Study1" or "Study2"
# If the person's ID is in the vector "Study1", make the value of a new 
# variable, "Study", equal to "Study1". If their ID is NOT in that vector, 
# then make them part of "Study2".

data <- data %>% 
  mutate(Study = ifelse(ID %in% Study1, "Study1", "Study2"))

Answer 2

您的示例数据很好，但它没有提供足够的可变性来让您在每个组合中都有分布。这可能只是盲目运气或您提供的样本的一个因素。无论哪种方式，这个答案的前提都不会因为演示而改变。

我假设您不需要 Desired 列中的完全匹配，只是为了在每个分层中均匀分布 Study。

我将使用 dplyr，因为我认为每个步骤都清楚要完成的工作。可以为它使用 sample_frac 或 runif(n()) < 0.5，但不能保证你会得到统一的分布。在此实现中，我只是对所有行进行随机排序，并在所有行中分配 1 或 2 变量。基于此，如果没有特定的因素组合，研究 1 和研究 2 之间的差异绝不会超过 1。

为了演示每组 n，我将简化为两个因素：年龄和性别。

library(dplyr)
set.seed(2) # for reproducibility only, do not include in production code

studies <- 1:2
out <- data %>%
  sample_n(n()) %>%
  group_by(Age, Gender) %>%
  mutate(Study = rep(studies, length.out = n())) %>%
  ungroup()

arrange(out, ID)
# # A tibble: 79 x 7
#       ID Age    Gender Race  Region    Desired Study
#    <int> <chr>  <chr>  <chr> <chr>     <chr>   <int>
#  1   930 4to6   Female White Northeast Study1      1
#  2   931 7to9   Male   White South     Study1      1
#  3   937 4to6   Male   White South     Study1      2
#  4   938 10to12 Male   White Midwest   Study1      1
#  5   939 13to15 Male   White Northeast Study1      2
#  6   941 16to18 Male   White West      Study1      1
#  7   944 10to12 Female White Midwest   Study1      1
#  8   946 4to6   Male   White Midwest   Study1      1
#  9   949 13to15 Female White West      Study1      2
# 10   952 16to18 Male   White Northeast Study1      1
# # ... with 69 more rows

我们可以查看它是否有效的一种方法是将其制成表格。原始数据：

xtabs(~ Gender + Age, data = data)
#         Age
# Gender   10to12 13to15 16to18 4to6 7to9
#   Female     10      6      8    7    5
#   Male        9     12      6    6   10

以及为每项研究选择的那些，显示两项研究之间的平均分布：

xtabs(~ Study + Age + Gender, data = out)
# , , Gender = Female
#      Age
# Study 10to12 13to15 16to18 4to6 7to9
#     1      5      3      4    4    3
#     2      5      3      4    3    2
# , , Gender = Male
#      Age
# Study 10to12 13to15 16to18 4to6 7to9
#     1      5      6      3    3    5
#     2      4      6      3    3    5

并证明在任何一个层中都不会超过 1 more/less：

group_by(out, Age, Gender) %>% summarize(differences = diff(range(table(Study))))
# # A tibble: 10 x 3
# # Groups:   Age [5]
#    Age    Gender differences
#    <chr>  <chr>        <int>
#  1 10to12 Female           0
#  2 10to12 Male             1
#  3 13to15 Female           0
#  4 13to15 Male             0
#  5 16to18 Female           0
#  6 16to18 Male             0
#  7 4to6   Female           1
#  8 4to6   Male             0
#  9 7to9   Female           1
# 10 7to9   Male             0

我重复了多达 10 项不同的研究，并且同一层内的研究之间从来没有超过 +/- 1。

对于您希望保留所有四个因素的实施，您将使用：

out <- data %>%
  sample_n(n()) %>%
  group_by(Age, Gender, Race, Region) %>%               # <--- the only difference
  mutate(Study = rep(studies, length.out = n())) %>%
  ungroup()

我应该补充一点，这也适用于两项以上的研究（例如，students <- 1:3：sample_n 和 rep(..., length.out=) 的组合使用确保你永远不会有每个层次的研究之间的差异超过 1。

选择率为 50% 的分层随机样本

Stratified Random Sample with a 50% Selection Rate

r

sampling