根据数据框中行中的条件删除重复项

Question

我有一个包含许多重复名称的数据框，下面是一个可重现的示例。
我正在尝试通过删除具有重复名称和最少信息的行来清理数据集。
我添加了一列，我在其中计算每行中单元格的 % 的 NA，在我的示例中，我将其称为 %_Scoring.

在重复的名称行中，我想保留具有最低 的行%_Scoring（占 NA 的百分比）
N:B如果%_Scoring相等，没关系，两行中的一行还是要去掉的。

data_people <- "https://raw.githubusercontent.com/max9nc9/Temp/main/data_people.csv"
data_people <- read.csv(data_people, sep = ",")

在上面的数据示例中，我只保留 2 行：

第一排是 Margarita Pan
第二行是 John Doe，其中 %_Scoring = 0.56

Answer 1

按'Name'分组后使用slice_max

library(dplyr)
data_people %>% 
    group_by(Name) %>%
    slice_max(n = 1, order_by = X._Scoring) %>%
    ungroup

-输出

# A tibble: 2 x 4
  Name          Information                    Height X._Scoring
  <chr>         <chr>                           <dbl>      <dbl>
1 John Doe      This is an information           1.88       0.89
2 Margarita Pan This is an information as well   1.47       0.78

或者如果我们想保持最小值，那么使用slice_min

data_people %>% 
    group_by(Name) %>%
    slice_min(n = 1, order_by = X._Scoring) %>%
    ungroup
# A tibble: 2 x 4
  Name          Information                    Height X._Scoring
  <chr>         <chr>                           <dbl>      <dbl>
1 John Doe      This is an information          NA          0.56
2 Margarita Pan This is an information as well   1.47       0.78

Answer 2

library(dplyr)
data_people %>% 
    group_by(Name) %>% 
    arrange(X._Scoring) %>% 
    filter(!duplicated(Name) & min(X._Scoring))

输出

  Name          Information                    Height X._Scoring
  <chr>         <chr>                           <dbl>      <dbl>
1 John Doe      This is an information          NA          0.56
2 Margarita Pan This is an information as well   1.47       0.78

Answer 3

基础 R 选项 duplicated + ave

subset(
  data_people,
  !duplicated(Name) & ave(rowSums(!is.na(data_people)), Name, FUN = function(x) x == max(x))
)

给予

           Name                    Information Height X._Scoring
1      John Doe         This is an information   1.88       0.89
2 Margarita Pan This is an information as well   1.47       0.78

根据数据框中行中的条件删除重复项

Remove duplicates based on conditions in rows in a dataframe

r

duplicates

dataframe

data-wrangling