在多个数据集中查找相同的行

Finding Identical Rows in Multiple Datasets

我试图找出 3 个数据集(df1、df2、df3)是否有任何共同的行(即整行重复)。

我想出了如何为 2 个数据集对执行此操作:

df1 = data.frame(id = c(1,2,3), names = c("john", "alex", "peter"))

df2 = data.frame(id = c(1,2,3), names = c("alex", "john", "peter"))

df3 = data.frame(id = c(1,2,3), names = c("peter", "john", "tim"))


library(dplyr)

inner_join(df1, df2)

inner_join(df1, df3)

inner_join(df2, df3)

直截了当的方法好像不行:

inner_join(df1, df2, df3)
Error in `[.data.frame`(by, c("x", "y")) : undefined columns selected

我以为我找到了一个方法来做到这一点:

library(plyr)
join_all(list(df1, df2, df3), type='inner')

但这告诉我这 3 个数据帧之间没有公共行(即相同的 ID、相同的名称):

Joining by: id, names
Joining by: id, names
[1] id    names
<0 rows> (or 0-length row.names)

这是不正确的,如我创建的示例所示:

我正在尝试找到一种方法来确定这 3 个数据集是否共享任何公共行。这可以在 R 中完成吗?

谢谢!

这个算吗?

dfall<-bind_rows(df1,df2,df3)
dfall[duplicated(dfall),]
  id names
6  3 peter
8  2  john

一个可能的解决方案(如果你想要一个数据框作为结果,只需在最后输入 bind_rows):

library(dplyr)

combn(paste0("df", 1:3), 2, simplify = F, \(x) inner_join(get(x[1]), get(x[2]))) 

#> Joining, by = c("id", "names")
#> Joining, by = c("id", "names")
#> Joining, by = c("id", "names")
#> [[1]]
#>   id names
#> 1  3 peter
#> 
#> [[2]]
#> [1] id    names
#> <0 rows> (or 0-length row.names)
#> 
#> [[3]]
#>   id names
#> 1  2  john

您可以使用 janitor 包中的 get_dupes 执行此操作。

library(tidyverse)
library(janitor)

# Added a new column 'df_id' to identify the data frame
df1 = data.frame(id = c(1,2,3), names = c("john", "alex", "peter"), df_id = 1) 
df2 = data.frame(id = c(1,2,3), names = c("alex", "john", "peter"), df_id = 2)
df3 = data.frame(id = c(1,2,3), names = c("peter", "john", "tim"), df_id = 3)

# Bind dataframes
# Get duplicates
df1 %>% 
  bind_rows(df2) %>% 
  bind_rows(df3) %>% 
  get_dupes(c(id, names))

#>   id names dupe_count df_id
#> 1  2  john          2     2
#> 2  2  john          2     3
#> 3  3 peter          2     1
#> 4  3 peter          2     2

以下是完成任务的方法:

library(dplyr)

bind_rows(df1, df2, df3) %>% 
  group_by(id, names) %>% 
  filter(n()>1) %>% 
  unique()
     id names
  <dbl> <chr>
1     3 peter
2     2 john