在多个数据集中查找相同的行
Finding Identical Rows in Multiple Datasets
我试图找出 3 个数据集(df1、df2、df3)是否有任何共同的行(即整行重复)。
我想出了如何为 2 个数据集对执行此操作:
df1 = data.frame(id = c(1,2,3), names = c("john", "alex", "peter"))
df2 = data.frame(id = c(1,2,3), names = c("alex", "john", "peter"))
df3 = data.frame(id = c(1,2,3), names = c("peter", "john", "tim"))
library(dplyr)
inner_join(df1, df2)
inner_join(df1, df3)
inner_join(df2, df3)
- 是否可以一次对 3 个数据集执行此操作?
直截了当的方法好像不行:
inner_join(df1, df2, df3)
Error in `[.data.frame`(by, c("x", "y")) : undefined columns selected
我以为我找到了一个方法来做到这一点:
library(plyr)
join_all(list(df1, df2, df3), type='inner')
但这告诉我这 3 个数据帧之间没有公共行(即相同的 ID、相同的名称):
Joining by: id, names
Joining by: id, names
[1] id names
<0 rows> (or 0-length row.names)
这是不正确的,如我创建的示例所示:
- df1 和 df2 的第 3 行相同(id = 3,name = peter)
- df2 和 df3 的第 2 行相同(id = 2,名称 = john)
我正在尝试找到一种方法来确定这 3 个数据集是否共享任何公共行。这可以在 R 中完成吗?
谢谢!
这个算吗?
dfall<-bind_rows(df1,df2,df3)
dfall[duplicated(dfall),]
id names
6 3 peter
8 2 john
一个可能的解决方案(如果你想要一个数据框作为结果,只需在最后输入 bind_rows
):
library(dplyr)
combn(paste0("df", 1:3), 2, simplify = F, \(x) inner_join(get(x[1]), get(x[2])))
#> Joining, by = c("id", "names")
#> Joining, by = c("id", "names")
#> Joining, by = c("id", "names")
#> [[1]]
#> id names
#> 1 3 peter
#>
#> [[2]]
#> [1] id names
#> <0 rows> (or 0-length row.names)
#>
#> [[3]]
#> id names
#> 1 2 john
您可以使用 janitor
包中的 get_dupes
执行此操作。
library(tidyverse)
library(janitor)
# Added a new column 'df_id' to identify the data frame
df1 = data.frame(id = c(1,2,3), names = c("john", "alex", "peter"), df_id = 1)
df2 = data.frame(id = c(1,2,3), names = c("alex", "john", "peter"), df_id = 2)
df3 = data.frame(id = c(1,2,3), names = c("peter", "john", "tim"), df_id = 3)
# Bind dataframes
# Get duplicates
df1 %>%
bind_rows(df2) %>%
bind_rows(df3) %>%
get_dupes(c(id, names))
#> id names dupe_count df_id
#> 1 2 john 2 2
#> 2 2 john 2 3
#> 3 3 peter 2 1
#> 4 3 peter 2 2
以下是完成任务的方法:
library(dplyr)
bind_rows(df1, df2, df3) %>%
group_by(id, names) %>%
filter(n()>1) %>%
unique()
id names
<dbl> <chr>
1 3 peter
2 2 john
我试图找出 3 个数据集(df1、df2、df3)是否有任何共同的行(即整行重复)。
我想出了如何为 2 个数据集对执行此操作:
df1 = data.frame(id = c(1,2,3), names = c("john", "alex", "peter"))
df2 = data.frame(id = c(1,2,3), names = c("alex", "john", "peter"))
df3 = data.frame(id = c(1,2,3), names = c("peter", "john", "tim"))
library(dplyr)
inner_join(df1, df2)
inner_join(df1, df3)
inner_join(df2, df3)
- 是否可以一次对 3 个数据集执行此操作?
直截了当的方法好像不行:
inner_join(df1, df2, df3)
Error in `[.data.frame`(by, c("x", "y")) : undefined columns selected
我以为我找到了一个方法来做到这一点:
library(plyr)
join_all(list(df1, df2, df3), type='inner')
但这告诉我这 3 个数据帧之间没有公共行(即相同的 ID、相同的名称):
Joining by: id, names
Joining by: id, names
[1] id names
<0 rows> (or 0-length row.names)
这是不正确的,如我创建的示例所示:
- df1 和 df2 的第 3 行相同(id = 3,name = peter)
- df2 和 df3 的第 2 行相同(id = 2,名称 = john)
我正在尝试找到一种方法来确定这 3 个数据集是否共享任何公共行。这可以在 R 中完成吗?
谢谢!
这个算吗?
dfall<-bind_rows(df1,df2,df3)
dfall[duplicated(dfall),]
id names
6 3 peter
8 2 john
一个可能的解决方案(如果你想要一个数据框作为结果,只需在最后输入 bind_rows
):
library(dplyr)
combn(paste0("df", 1:3), 2, simplify = F, \(x) inner_join(get(x[1]), get(x[2])))
#> Joining, by = c("id", "names")
#> Joining, by = c("id", "names")
#> Joining, by = c("id", "names")
#> [[1]]
#> id names
#> 1 3 peter
#>
#> [[2]]
#> [1] id names
#> <0 rows> (or 0-length row.names)
#>
#> [[3]]
#> id names
#> 1 2 john
您可以使用 janitor
包中的 get_dupes
执行此操作。
library(tidyverse)
library(janitor)
# Added a new column 'df_id' to identify the data frame
df1 = data.frame(id = c(1,2,3), names = c("john", "alex", "peter"), df_id = 1)
df2 = data.frame(id = c(1,2,3), names = c("alex", "john", "peter"), df_id = 2)
df3 = data.frame(id = c(1,2,3), names = c("peter", "john", "tim"), df_id = 3)
# Bind dataframes
# Get duplicates
df1 %>%
bind_rows(df2) %>%
bind_rows(df3) %>%
get_dupes(c(id, names))
#> id names dupe_count df_id
#> 1 2 john 2 2
#> 2 2 john 2 3
#> 3 3 peter 2 1
#> 4 3 peter 2 2
以下是完成任务的方法:
library(dplyr)
bind_rows(df1, df2, df3) %>%
group_by(id, names) %>%
filter(n()>1) %>%
unique()
id names
<dbl> <chr>
1 3 peter
2 2 john