在R中合并数据后如何识别哪些数据属于哪个数据集?
How to identify which data belongs to which dataset after merging them in R?
我有两个大数据集,每个都有超过 300 万个 obs,我使用 full_join 使用变量“N_AIH”将它们合并在一起。事情是,在数据集 1 中,这个变量称为“N_AIH”,在数据集 2 中,它称为“NUM_AIH”。这就是我加入他们的方式:
join_test <- full_join(dataset1, dataset2, by = c("N_AIH" = "NUM_AIH"), keep = TRUE)
我必须将两个变量都保留在连接的数据集中,但现在我需要确定:
1 - 两个数据集中的 Obs(匹配项)
2 - 在数据集 1 中但不在数据集 2 中的 Obs
3 - 在数据集 2 中但不在数据集 1 中的 Obs
我似乎找不到办法。我需要使用 N_AIH/NUM_AIH 变量。
library(tidyverse)
dataset1 <- tibble(N_AIH = seq(5))
dataset1
#> # A tibble: 5 x 1
#> N_AIH
#> <int>
#> 1 1
#> 2 2
#> 3 3
#> 4 4
#> 5 5
dataset2 <- tibble(NUM_AIH = seq(3))
dataset2
#> # A tibble: 3 x 1
#> NUM_AIH
#> <int>
#> 1 1
#> 2 2
#> 3 3
joined <-
full_join(dataset1, dataset2, by = c("N_AIH" = "NUM_AIH"), keep = TRUE)
joined
#> # A tibble: 5 x 2
#> N_AIH NUM_AIH
#> <int> <int>
#> 1 1 1
#> 2 2 2
#> 3 3 3
#> 4 4 NA
#> 5 5 NA
# observations present in both datasets
inner_join(dataset1, dataset2, by = c("N_AIH" = "NUM_AIH"), keep = TRUE) %>%
distinct(N_AIH, .keep_all = TRUE)
#> # A tibble: 3 x 2
#> N_AIH NUM_AIH
#> <int> <int>
#> 1 1 1
#> 2 2 2
#> 3 3 3
# Obs that were in dataset 1 but weren't in dataset2
anti_join(dataset1, dataset2, by = c("N_AIH" = "NUM_AIH"))
#> # A tibble: 2 x 1
#> N_AIH
#> <int>
#> 1 4
#> 2 5
# Obs that were in dataset 2 but weren't in dataset1
anti_join(dataset2, dataset1, by = c("NUM_AIH" = "N_AIH"))
#> # A tibble: 0 x 1
#> # … with 1 variable: NUM_AIH <int>
由 reprex package (v2.0.1)
于 2021-10-04 创建
我有两个大数据集,每个都有超过 300 万个 obs,我使用 full_join 使用变量“N_AIH”将它们合并在一起。事情是,在数据集 1 中,这个变量称为“N_AIH”,在数据集 2 中,它称为“NUM_AIH”。这就是我加入他们的方式:
join_test <- full_join(dataset1, dataset2, by = c("N_AIH" = "NUM_AIH"), keep = TRUE)
我必须将两个变量都保留在连接的数据集中,但现在我需要确定:
1 - 两个数据集中的 Obs(匹配项) 2 - 在数据集 1 中但不在数据集 2 中的 Obs 3 - 在数据集 2 中但不在数据集 1 中的 Obs
我似乎找不到办法。我需要使用 N_AIH/NUM_AIH 变量。
library(tidyverse)
dataset1 <- tibble(N_AIH = seq(5))
dataset1
#> # A tibble: 5 x 1
#> N_AIH
#> <int>
#> 1 1
#> 2 2
#> 3 3
#> 4 4
#> 5 5
dataset2 <- tibble(NUM_AIH = seq(3))
dataset2
#> # A tibble: 3 x 1
#> NUM_AIH
#> <int>
#> 1 1
#> 2 2
#> 3 3
joined <-
full_join(dataset1, dataset2, by = c("N_AIH" = "NUM_AIH"), keep = TRUE)
joined
#> # A tibble: 5 x 2
#> N_AIH NUM_AIH
#> <int> <int>
#> 1 1 1
#> 2 2 2
#> 3 3 3
#> 4 4 NA
#> 5 5 NA
# observations present in both datasets
inner_join(dataset1, dataset2, by = c("N_AIH" = "NUM_AIH"), keep = TRUE) %>%
distinct(N_AIH, .keep_all = TRUE)
#> # A tibble: 3 x 2
#> N_AIH NUM_AIH
#> <int> <int>
#> 1 1 1
#> 2 2 2
#> 3 3 3
# Obs that were in dataset 1 but weren't in dataset2
anti_join(dataset1, dataset2, by = c("N_AIH" = "NUM_AIH"))
#> # A tibble: 2 x 1
#> N_AIH
#> <int>
#> 1 4
#> 2 5
# Obs that were in dataset 2 but weren't in dataset1
anti_join(dataset2, dataset1, by = c("NUM_AIH" = "N_AIH"))
#> # A tibble: 0 x 1
#> # … with 1 variable: NUM_AIH <int>
由 reprex package (v2.0.1)
于 2021-10-04 创建