使用来自 R 中不同数据帧的过滤信息从数据帧中的列随机采样

Question

这是我第一次在这里发布问题，所以如果我没有遵循某些准则，请告诉我，我会立即更改它。

基本上我的问题如下：我有两个数据集（为简单起见，我们称它们为数据集 A 和数据集 B），它们由一系列公共列组成，每个列都包含社会人口统计特征 individual/observation/row。我需要的是对于数据集 A 中的每个 observation/row，我必须 select 来自数据集 B 的随机观察，该观察具有与关键社会人口变量相关的匹配特征。为了便于说明，我准备了一个简单的例子：

library("dplyr")

A = data.frame(nuts2 = c(1, 1, 2, 2, 3, 3), gender = c(1, 2, 1, 2, 1, 2))
B = data.frame(nuts2 = c(rep(1,10), rep(2,10), rep(3,10)), gender=c(rep(1, 5), rep(2, 5), rep(1, 5), rep(2, 5), rep(1, 5), rep(2, 5)))

A <- A[sample(1:nrow(A)), ] %>% mutate(id = seq(1:nrow(A)))
B <- B[sample(1:nrow(B)), ] %>% mutate(id = seq(1:nrow(B)))

我试图避免 forloops，因为它在 R 中似乎被认为是一种不好的做法，所以我尝试创建一个函数并在每次观察时使用运行它。假设我们希望将来自 B 的随机观察 ID 与 A 中的观察具有相同的性别和 nuts2 值匹配，我的代码如下：

matching_fun <- function(x) {
  donor <- B %>% filter(gender == A$gender & nuts2 == A$nuts2) %>% sample_n(1) 
  donor_id <- donor$id
  return(donor_id)
}

A$donor_id <- apply(A, 1, matching_fun)

我希望这会产生一个数据框，其中包含 A 中存在的所有信息和一个名为 don_id 的额外列，其中相应的捐助者 ID 通过 B 中的社会人口群体随机抽样确定。

但是，我的代码没有准确执行匹配，也没有尊重社会人口特征。谁能告诉我我做错了什么？

提前感谢您的任何 support/comment/critique。

注意：我的数据集每个都有近两百万个观察值，我将不得不在几个测试中使用它。因此，计算效率具有一定的重要性。

Answer 1

这是一个data.table方法

setDT(A); setDT(B)
A[B, on = .(nuts2, gender)][, .(id = i.id[[sample(.N, 1L)]]), by = .(nuts2, gender)]

输出

> set.seed(1L)
> A[B, on = .(nuts2, gender)][, .(id = i.id[[sample(.N, 1L)]]), by = .(nuts2, gender)]
   nuts2 gender id
1:     1      2  1
2:     2      2 16
3:     2      1  4
4:     1      1  6
5:     3      2 25
6:     3      1 22
> A[B, on = .(nuts2, gender)][, .(id = i.id[[sample(.N, 1L)]]), by = .(nuts2, gender)]
   nuts2 gender id
1:     1      2  8
2:     2      2 10
3:     2      1 24
4:     1      1  5
5:     3      2 25
6:     3      1 29
> A[B, on = .(nuts2, gender)][, .(id = i.id[[sample(.N, 1L)]]), by = .(nuts2, gender)]
   nuts2 gender id
1:     1      2  8
2:     2      2  3
3:     2      1  4
4:     1      1 18
5:     3      2 25
6:     3      1 13

更新

我重新格式化代码以帮助您理解其背后的逻辑。

library(data.table)
setDT(A); setDT(B)
A[
  B, 
  on = .(nuts2, gender)
][
  , .(id = id[[1L]], donor_id = i.id[[sample(.N, 1L)]]), 
  by = .(nuts2, gender)
]

输出看起来像这样

   nuts2 gender id donor_id
1:     2      1  1       15
2:     1      2  2        2
3:     3      2  5        8
4:     2      2  3        6
5:     1      1  4        9
6:     3      1  6       22

上面的代码与下面的 dplyr 流水线基本相同，只是方式更高效。

left_join(A, B, by = c("nuts2", "gender")) %>% rename(id = id.x, donor_id = id.y) %>% group_by(nuts2, gender) %>% slice_sample(n = 1L)

如果还是看不懂，请看下图：

Let A and B be the dataframes as follows:
       A                   B
nuts2 gender id     nuts2 gender  id
    1      1  1         1      1   2
    2      1  3         1      1   7
                        2      1  13
                        2      1   4
                        2      1   6
What you want to do is:

First, match these groups
       A                   B
nuts2 gender id     nuts2 gender  id
    1      1  1         1      1   2
                        1      1   7
---------------     ----------------
    2      1  3         2      1  13
                        2      1   4
                        2      1   6

Second, slice a sample for each group in B
       A                   B
nuts2 gender id     nuts2 gender  id
    1      1  1         1      1   2
                        1      1   7 <--- RNG gives you this
---------------     ----------------
    2      1  3         2      1  13
                        2      1   4 <--- RNG gives you this
                        2      1   6 

Then, create a new variable in A with the id in B and call it donor_id
       A                            B
nuts2 gender id donor_id     nuts2 gender  id
    1      1  1        7         1      1   2
                                 1      1   7 <--- RNG gives you this
------------------------     ----------------
    2      1  3        4         2      1  13
                                 2      1   4 <--- RNG gives you this
                                 2      1   6 

However, an equivalent but more efficient way is 

First, join A and B on/by nuts2 and gender. In this way we can determine the population that we want to sample from.
       A                     B
+---------------+       +----------------+                       +----------------------+
|nuts2 gender id|       |nuts2 gender  id|                       |nuts2 gender id.A id.B|
|    1      1  1|       |    1      1   2|   by (nuts2, gender)  |    1      1    1    2|
|    2      1  3|   +   |    1      1   7|        =======>       |    1      1    1    7|
+---------------+       |    2      1  13|                       |    2      1    3   13|
                        |    2      1   4|                       |    2      1    3    4|
                        |    2      1   6|                       |    2      1    3    6|
                        +----------------+                       +----------------------+

Then, just slice a sample row within each (nuts2, gender) group and rename id.A and id.B as id and donor_id, respectively.
        A + B
               id donor_id
nuts2 gender XXXX     XXXX
    1      1    1        2
    1      1    1        7 <--- RNG gives you this
--------------------------
    2      1    3       13
    2      1    3        4 <--- RNG gives you this
    2      1    3        6

这就是我的代码所做的。

这部分意味着在 nuts2 和 gender 的每个匹配组中加入 A 和 B。

A[
  B, 
  on = .(nuts2, gender)
]

然后，我们从每组 nuts2 和 gender 中的 .N 行中切出一个示例行。 .N 是 data.table 包中的保留字；它为您提供了每组中的行数。有一个 i.id 因为 A 和 B 都有列 id 并且 data.table 在连接后自动将 B 的 id 重命名为 i.id。此外，我们只需要 id[[1L]]，因为 id 在每个组中都是相同的。

A[
  B, 
  on = .(nuts2, gender)
][
  , .(id = id[[1L]], donor_id = i.id[[sample(.N, 1L)]]), 
  by = .(nuts2, gender)
]

Answer 2

如果您想避免 for 循环，也许 if/else 循环适合您：

library("dplyr")

#set.seed(1)

A = data.frame(nuts2 = c(1, 1, 2, 2, 3, 3), gender = c(1, 2, 1, 2, 1, 2))
B = data.frame(nuts2 = c(rep(1,10), rep(2,10), rep(3,10)), gender=c(rep(1, 5), rep(2, 5), rep(1, 5), rep(2, 5), rep(1, 5), rep(2, 5)))

A <- A[sample(1:nrow(A)), ] %>% mutate(id = seq(1:nrow(A)))
B <- B[sample(1:nrow(B)), ] %>% mutate(id = seq(1:nrow(B)))


  if (is.element(A$gender,B$gender) & is.element(A$nuts2, B$nuts2)){
  donor_id <- sample(B$id, 6)
  filter<-A[is.element(A$gender,B$gender) & is.element(A$nuts2, B$nuts2),]
  result<-cbind(filter, donor_id)[-3]
  print(result)
  }else{
  print("No matching characteristics")
}

使用来自 R 中不同数据帧的过滤信息从数据帧中的列随机采样

Random sampling from a column in a dataframe using filtering information from a different dataframe in R

r

matching