同一池的两个随机唯一样本

Question

我正在尝试获取两个样本，每个样本中都有独特的元素。也就是说，“第一个”向量中的字符串不能在“第二个”向量中。不幸的是，我总是得到重复的字符串，而且我看不到找到解决这个问题的方法。我尝试使用 if-else 来解决，但没有成功。

编辑：最终输出应该是成对的。第一个中的相同数字应该在第二个中。唯一会有所不同的是字母。每个字母必须正好出现三次。我不想要重复元素的原因是，当我创建对时，我得到了 1_W 和 1_W 这样的对。那不可能发生。

输出应该是这样的：

first: 12_U, 23_U, 6_U, 8_T, 24_T, 22_T, 7_S, 10_S, 19_S, 21_W, 14_W, 2_W

second: 12_W, 23_W, 6_W, 8_S, 24_S, 22_S, 7_T, 10_T, 19_T, 21_U, 14_U, 2_U

编辑 2：

我在解释我需要什么方面做得很糟糕。此代码将用于 select 我要收集数据的研究的标题。

每个主题代表一个关于特定主题的标题，例如全球变暖。有24个主题。每个版本（U、T、S、W）代表真实标题 (T) 的变体。

我有一个标题库，共有 96 个标题，主题和版本各不相同。 1_U是主题1的U版。我想看看参与者会为每一对选择哪些版本。

我需要的是

至select 12个主题；
在同一主题内创建配对，以便参与者可以在同一标题的两个版本之间进行选择。
参与者需要始终看到：12 对（同一主题的 2 个版本）。
我还需要保证他们看到每个版本的比例相同。这就是我“首先”创建矢量和矢量的原因符合此条件的“第二”。

但是我得到的是重复版本的配对。因此，我得到的一些对是 12_S 和 12_S，而它们应该是 12_S 和任何其他版本（12_U、12_S 或 12_W) 因为参与者在主题 12 的 S 版本和主题 12 的 S 版本之间进行选择是没有意义的。

通过创建两个向量，除了有些对包含相同的标题之外，我能够得到我想要的东西。

themes <- c(1:24)
set.seed(1)
twelve <- sample(themes, 12)
versions <- c('U', 'T', 'S', 'W')

set.seed(14) 
first <- sample(paste(sample(twelve), rep(versions, 3), sep='_'))
second <- sample(paste(sample(twelve), rep(versions, 3), sep='_'))

repeated <- first[first %in% second]

if (is.null(repeated)) {
  print(second) #if there are no elements in the vector "repeated", then print repeated
} else {
  x <- sample(paste(sample(twelve), rep(versions, 3), sep='_')) #otherwise, pick another sample
}

Answer 1

为确保获得 2 个向量 first 和 second，其中 first 中的主题在 second 中不存在，您需要在向量中重复主题，或者您必须使用采样来拆分主题。

set.seed(1)
themes <- 1:24
versions <- c('U', 'T', 'S', 'W')
split_idx <- sample(length(themes), 0.5*length(themes))
set_1 <- themes[split_idx]
set_2 <- themes[-split_idx]

它创建了 2 个独特的样本，由

验证

set_1 %in% set_2

哪个应该 return 一个只有 FALSE 个条目的布尔向量。

如果你只想在最后的 2 个向量中使用 3 个字母，我建议如下：

first <- paste(sample(set_1), sample(versions, 3), sep = "_")
secnd <- paste(sample(set_2), sample(versions, 3), sep = "_")

不需要使用 rep(versions, 3)，因为如果一个向量较短，R 会自动复制。

要获得具有保留这些属性的不断变化的主题的新矢量，您必须再次将主题拆分为 2 组。

编辑 1：回应更新后的问题。

要生成一个主题示例：

set.seed(1)
themes <- 1:24
versions <- c('U', 'T', 'S', 'W')
theme_sample <- sample(themes, 12)

为了使两个向量之间的版本随机且不同，我想到了以下“hacky”解决方案。

first_versions <- sample(versions)
while(sum((second_versions <- sample(versions)) == first_versions) != 0){}

上面创建了一个样本，然后不断地重新创建第二个样本，直到版本不再按元素重复。剩下的就是得到最终的向量

first <- paste(theme_sample, first_versions, sep = "_")
second <- paste(theme_sample, second_versions, sep = "_")

根据需要。

Answer 2

这里是蛮力方法。我将为 12 名参与者选择的两个主题创建两个 samp 文件。 sample versions 同理。 repeat 直到两个参与者都没有重复（即在结果矩阵的每一行中）。接下来，使用 Map 每两次复制 samp_vs 行，并一起复制 paste 行。将其包装在函数 samp_fun.

中

samp_fun <- \(themes, versions) {
  themes_12 <- sample(themes, 12)
  repeat {
    samp_th <- replicate(2, sample(themes_12))
    samp_vs <- replicate(2, sample(versions))
    if (!any(apply(samp_th, 1, duplicated)) &
        !any(apply(samp_vs, 1, duplicated))) break
  }
  samp_vs <- samp_vs[rep(seq_len(nrow(samp_vs)), each=3), ]
  Map(\(...) paste(..., sep='_'),
      as.data.frame(samp_th), as.data.frame(samp_vs)) |>
    setNames(c('first', 'second'))
}

用法

themes <- 1:24
versions <- c('U', 'T', 'S', 'W')

set.seed(42)
res <- samp_fun(themes, versions)

结果

给出两个组的列表。

res$first
# [1] "4_S"  "15_S" "9_S"  "18_T" "5_T"  "20_T"
# [7] "17_W" "24_W" "8_W"  "7_U"  "1_U"  "10_U"

res$second
# [1] "15_U" "4_U"  "10_U" "8_W"  "7_W"  "24_W"
# [7] "5_S"  "18_S" "1_S"  "17_T" "9_T"  "20_T"

如果要在工作区中使用 first、second，请使用 list2env。

list2env(res, .GlobalEnv)
first
second

注意： R >= 4.1 使用。

Answer 3

我认为你可以让你的生活更轻松地对你的配对（没有重复）进行采样，然后粘贴你的主题值。因此，我们首先对 12 个主题进行采样，然后应用该列表并将其粘贴到您的两个版本中。你得到一个包含 2 行的矩阵。

set.seed(1)

themes <- 1:24
versions <- c("U", "T", "S", "W")

pairs <- sapply(sample(themes, 12), FUN = function(x) paste(x, sample(versions, 2), sep = "_"))

pairs
#      [,1]  [,2]  [,3]  [,4]  [,5]   [,6]   [,7]   [,8]   [,9]  [,10]  [,11]  [,12]
# [1,] "4_T" "7_S" "1_S" "2_U" "11_U" "14_U" "18_T" "22_T" "5_W" "16_U" "10_T" "6_T"
# [2,] "4_W" "7_U" "1_U" "2_W" "11_T" "14_W" "18_W" "22_U" "5_S" "16_S" "10_W" "6_W"

first <- pairs[1, ]
# [1] "4_T"  "7_S"  "1_S"  "2_U"  "11_U" "14_U" "18_T" "22_T" "5_W"  "16_U" "10_T" "6_T" 

second <- pairs[2, ]
# [1] "4_W"  "7_U"  "1_U"  "2_W"  "11_T" "14_W" "18_W" "22_U" "5_S"  "16_S" "10_W" "6_W"

同一池的两个随机唯一样本

Two random unique samples of the same pool

r

sample

unique

编辑 1：回应更新后的问题。

用法

结果