在R中合并数据后如何识别哪些数据属于哪个数据集?

How to identify which data belongs to which dataset after merging them in R?

我有两个大数据集,每个都有超过 300 万个 obs,我使用 full_join 使用变量“N_AIH”将它们合并在一起。事情是,在数据集 1 中,这个变量称为“N_AIH”,在数据集 2 中,它称为“NUM_AIH”。这就是我加入他们的方式:

join_test <- full_join(dataset1, dataset2, by = c("N_AIH" = "NUM_AIH"), keep = TRUE)

我必须将两个变量都保留在连接的数据集中,但现在我需要确定:

1 - 两个数据集中的 Obs(匹配项) 2 - 在数据集 1 中但不在数据集 2 中的 Obs 3 - 在数据集 2 中但不在数据集 1 中的 Obs

我似乎找不到办法。我需要使用 N_AIH/NUM_AIH 变量。

library(tidyverse)

dataset1 <- tibble(N_AIH = seq(5))
dataset1
#> # A tibble: 5 x 1
#>   N_AIH
#>   <int>
#> 1     1
#> 2     2
#> 3     3
#> 4     4
#> 5     5
dataset2 <- tibble(NUM_AIH = seq(3))
dataset2
#> # A tibble: 3 x 1
#>   NUM_AIH
#>     <int>
#> 1       1
#> 2       2
#> 3       3

joined <-
  full_join(dataset1, dataset2, by = c("N_AIH" = "NUM_AIH"), keep = TRUE)
joined
#> # A tibble: 5 x 2
#>   N_AIH NUM_AIH
#>   <int>   <int>
#> 1     1       1
#> 2     2       2
#> 3     3       3
#> 4     4      NA
#> 5     5      NA

# observations present in both datasets
inner_join(dataset1, dataset2, by = c("N_AIH" = "NUM_AIH"), keep = TRUE) %>%
  distinct(N_AIH, .keep_all = TRUE)
#> # A tibble: 3 x 2
#>   N_AIH NUM_AIH
#>   <int>   <int>
#> 1     1       1
#> 2     2       2
#> 3     3       3

# Obs that were in dataset 1 but weren't in dataset2 
anti_join(dataset1, dataset2, by = c("N_AIH" = "NUM_AIH"))
#> # A tibble: 2 x 1
#>   N_AIH
#>   <int>
#> 1     4
#> 2     5

# Obs that were in dataset 2 but weren't in dataset1
anti_join(dataset2, dataset1, by = c("NUM_AIH" = "N_AIH"))
#> # A tibble: 0 x 1
#> # … with 1 variable: NUM_AIH <int>

reprex package (v2.0.1)

于 2021-10-04 创建