在 R 中灵活地跨列查找重复值的独特案例

Question

我有一个类似于以下的数据集：

df <- data.frame(animal_1 = c("cat", "dog", "mouse", "squirrel"),
                 predation_type = c("eats", "eats", "eaten by", "eats"),
                 animal_2 = c("mouse", "squirrel", "cat", "nuts"))

> df
  animal_1 predation_type animal_2
1      cat           eats    mouse
2      dog           eats squirrel
3    mouse       eaten by      cat
4 squirrel           eats     nuts

我正在寻找将第 1 行和第 3 行标识为重复项的代码，因为它们显示相同的现象（猫吃老鼠或老鼠被猫吃掉）。我不确定如何询问我正在寻找什么样的重复案例，所以我希望有人能提供帮助。我试过将文本合并到一栏中（即“catmouse”、“dogsquirrel”等），然后反转字母，但很快证明这太复杂了。

非常感谢您提供的任何帮助。

Answer 1

您可以 sort() 数据框 duplicated() 有用。

newdf = df[, c('animal_1', 'animal_2')]

for (i in 1:nrow(df)){
  newdf[i, ] = sort(df[i,])
}

newdf[!(duplicated(newdf$animal_1) & duplicated(newdf$animal_2)),]

  animal_1 animal_2
1      cat    mouse
2      dog squirrel
4     nuts squirrel

Answer 2

tidyverse

df <- data.frame(animal_1 = c("cat", "dog", "mouse", "squirrel"),
                 predation_type = c("eats", "eats", "eaten by", "eats"),
                 animal_2 = c("mouse", "squirrel", "cat", "nuts"))
library(tidyverse)

df %>% 
  rowwise() %>% 
  mutate(duplicates = str_c(sort(c_across(c(1, 3))), collapse = "")) %>% 
  group_by(duplicates) %>% 
  mutate(duplicates = n() > 1) %>% 
  ungroup()
#> # A tibble: 4 x 4
#>   animal_1 predation_type animal_2 duplicates
#>   <chr>    <chr>          <chr>    <lgl>     
#> 1 cat      eats           mouse    TRUE      
#> 2 dog      eats           squirrel FALSE     
#> 3 mouse    eaten by       cat      TRUE      
#> 4 squirrel eats           nuts     FALSE

^{由 reprex package (v2.0.1)}

创建于 2022-01-17

删除重复项


library(tidyverse)
df %>% 
  filter(!duplicated(map2(animal_1, animal_2, ~str_c(sort((c(.x, .y))), collapse = ""))))
#>   animal_1 predation_type animal_2
#> 1      cat           eats    mouse
#> 2      dog           eats squirrel
#> 3 squirrel           eats     nuts

^{由 reprex package (v2.0.1)}

创建于 2022-01-17

在 R 中灵活地跨列查找重复值的独特案例

Unique case of finding duplicate values flexibly across columns in R

row

r

duplicates