如何在完全删除某些实例的同时删除重复的行（例如 dplyr::distinct()）（基于特定列中的相似性）？

Question

通常我使用 dplyr::distinct() 从数据中删除重复的行。此函数选择重复行的一个副本并保留它。
但是，如果怀疑该行无效，有时我希望删除所有副本。

例子

假设我对人们进行调查并询问他们的身高、体重和来自的国家/地区。

library(dplyr)
library(tibble)

set.seed(2021)
df_1 <- data.frame(id = 1:10,
           height = sample(c(150:210), size = 10),
           weight = sample(c(80: 200), size = 10))

df_2 <- df_1

df_final <- rbind(df_1, df_2)
df_final <- dplyr::arrange(df_final, id)

df_final <-
  df_final %>%
  add_column("country" = c("uk", "uk", 
                           "france", "usa", 
                           "germany", "germany",
                           "denmark", "norway",
                           "india", "india",
                           "chine", "china",
                           "mozambique", "argentina",
                           "morroco", "morroco",
                           "sweden", "japan",
                           "italy", "italy"))


df_final
#>    id height weight    country
#> 1   1    156    189         uk
#> 2   1    156    189         uk
#> 3   2    187    148     france
#> 4   2    187    148        usa
#> 5   3    195    190    germany
#> 6   3    195    190    germany
#> 7   4    207    182    denmark
#> 8   4    207    182     norway
#> 9   5    188    184      india
#> 10  5    188    184      india
#> 11  6    161    102      chine
#> 12  6    161    102      china
#> 13  7    201    155 mozambique
#> 14  7    201    155  argentina
#> 15  8    155    130    morroco
#> 16  8    155    130    morroco
#> 17  9    209    139     sweden
#> 18  9    209    139      japan
#> 19 10    202     97      italy
#> 20 10    202     97      italy

^{由 reprex package (v2.0.0)}

于 2021-07-19 创建

在df_final中，每个id表示一个人。在这个示例数据中，我们对所有 10 个人都有重复项。每个人都参加了两次调查。然而，如果我们仔细观察，我们会发现有些人报告说他们来自不同的国家。例如，id == 2 在一个案例中报告了 usa，在另一个案例中报告了 france。在我的数据清理中，我希望删除那些人。

我的主要目标是删除重复项。我的次要目标是过滤掉那些回答不同 country.

的人

如果我只选择 dplyr::distinct()，我将保留所有 10 个 id。

df_final %>%
  distinct(id, .keep_all = TRUE)
#>    id height weight    country
#> 1   1    156    189         uk
#> 2   2    187    148     france
#> 3   3    195    190    germany
#> 4   4    207    182    denmark
#> 5   5    188    184      india
#> 6   6    161    102      chine
#> 7   7    201    155 mozambique
#> 8   8    155    130    morroco
#> 9   9    209    139     sweden
#> 10 10    202     97      italy

我应该怎么做才能运行 distinct() 但仅限那些在所有副本中 country 具有相同值的副本（每个 id）？

谢谢

Answer 1

这是一个选项...

df_final %>% 
   group_by(id) %>% 
   filter(length(unique(country)) == 1) %>% 
   distinct()

# A tibble: 5 x 4
# Groups:   id [5]
     id height weight country
  <int>  <int>  <int> <chr>  
1     1    177     83 uk     
2     3    191    151 germany
3     5    186    175 india  
4     8    164    178 morroco
5    10    201    141 italy

Answer 2

我们也可以

library(dplyr)
df_final %>%
    distinct(id, country, .keep_all = TRUE) %>%
    filter(id %in% names(which(table(id) == 1)))

如何在完全删除某些实例的同时删除重复的行（例如 dplyr::distinct()）（基于特定列中的相似性）？

How to remove duplicated rows (e.g. dplyr::distinct()) while deleting some instances entirely (based on similarity in a specific column)?

r

duplicates

dplyr

例子