将来自两个数据框的个体配对

Pairing individuals from two dataframes

我有两个数据框。第一个包含来自 10 个个体的数据,而第二个包含 1000 个。数据帧之间的变量相同:

set.seed(2022)
SmallCohort <- data.frame("ID" = paste0("S00", sample(1000:9999, 10)),
                          "Sex" = sample(c("M", "F")),
                          "Age" = sample(20:30, 10, replace = TRUE),
                          "Smoke" = sample(c("Y","N")),
                          "Disease" = sample(c("Y","N")))

BigCohort <- data.frame("Subj" = paste0("B00", sample(1000:9999, 1000)),
                        "Sex" = sample(c("M", "F")),
                        "Age" = sample(20:30, 1000, replace = TRUE),
                        "Smoke" = sample(c("Y","N")),
                        "Disease" = sample(c("Y","N")))

我的目标是从 BigCohort 中找到成对的个体。基本上,我需要从 BigCohort 中选择 10 个 unique 个人,配对 SexAgeSmokeDisease 来自小队列的人数相同。

我不是在寻找倾向得分匹配。

我试过这个:

library(dplyr)
Combined <- inner_join(SmallCohort, BigCohort)

我昨天刚做了类似的事情!一个简单的循环和一些 tidyverse 函数就可以完成工作。

首先,您没有在示例数据中正确使用 sample,所以我们先更正一下:

set.seed(2022)
SmallCohort <- data.frame("ID" = paste0("S00", sample(1000:9999, 10)),
                          "Sex" = sample(c("M", "F")),
                          "Age" = sample(20:30, 10, replace = TRUE),
                          "Smoke" = sample(c("Y","N"), 10, replace = T),
                          "Disease" = sample(c("Y","N"), 10, replace = T))

BigCohort <- data.frame("Subj" = paste0("B00", sample(1000:9999, 1000)),
                        "Sex" = sample(c("M", "F")),
                        "Age" = sample(20:30, 1000, replace = TRUE),
                        "Smoke" = sample(c("Y","N"), 1000, replace = T),
                        "Disease" = sample(c("Y","N"), 1000, replace = T))

对于 SmallCohort 中的每个人,循环检查 BigCohort 中的哪些人与您选择的属性(性别、年龄、吸烟、疾病)完全匹配。从 BigCohort 中选择第一个符合条件的匹配人员(如果他们尚未匹配)。此循环向 SmallCohort 添加一个名为“BigMatch”的列,该列存储来自 BigCohort 的适当匹配项(这会跟踪匹配项并存储最终结果)。您也可以使用它来提取 BigCohort 匹配列表。

library(tidyverse)

SmallCohort$BigMatch <- NA
for (i in 1:nrow(SmallCohort)) {
  
  small.attributes <- select(SmallCohort[i, ], -ID, -BigMatch)
  
  big_matches <- BigCohort %>% 
    mutate(across(-Subj, ~.x == small.attributes[[cur_column()]])) %>% 
    filter(if_all(-Subj, ~. == T) & !(Subj %in% SmallCohort$BigMatch)) 
  
  SmallCohort$BigMatch[i] <- big_matches$Subj[1]
}

BigCohortMatched <- filter(BigCohort, Subj %in% SmallCohort$BigMatch)

SmallCohort
        ID Sex Age Smoke Disease BigMatch
1  S005323   M  21     Y       Y  B007379
2  S008885   F  26     N       N  B009542
3  S003870   M  25     Y       Y  B008244
4  S009899   F  24     N       N  B004840
5  S005869   M  20     N       N  B009496
6  S003750   F  22     N       N  B005087
7  S009998   M  20     Y       Y  B008961
8  S001475   F  21     Y       N  B006545
9  S001122   M  29     N       Y  B007727
10 S002271   F  26     Y       N  B008913

BigCohortMatched
      Subj Sex Age Smoke Disease
1  B007727   M  29     N       Y
2  B005087   F  22     N       N
3  B009542   F  26     N       N
4  B008961   M  20     Y       Y
5  B008244   M  25     Y       Y
6  B006545   F  21     Y       N
7  B008913   F  26     Y       N
8  B009496   M  20     N       N
9  B004840   F  24     N       N
10 B007379   M  21     Y       Y