根据数据框中行中的条件删除重复项

Remove duplicates based on conditions in rows in a dataframe

我有一个包含许多重复名称的数据框,下面是一个可重现的示例。
我正在尝试通过删除具有重复名称和最少信息的行来清理数据集。
我添加了一列,我在其中计算每行中单元格的 % 的 NA,在我的示例中,我将其称为 %_Scoring.

在重复的名称行中,我想保留具有 最低 的行%_Scoring(占 NA 的百分比)
N:B如果%_Scoring相等,没关系,两行中的一行还是要去掉的。

data_people <- "https://raw.githubusercontent.com/max9nc9/Temp/main/data_people.csv"
data_people <- read.csv(data_people, sep = ",")

在上面的数据示例中,我只保留 2 行:

按'Name'分组后使用slice_max

library(dplyr)
data_people %>% 
    group_by(Name) %>%
    slice_max(n = 1, order_by = X._Scoring) %>%
    ungroup

-输出

# A tibble: 2 x 4
  Name          Information                    Height X._Scoring
  <chr>         <chr>                           <dbl>      <dbl>
1 John Doe      This is an information           1.88       0.89
2 Margarita Pan This is an information as well   1.47       0.78

或者如果我们想保持最小值,那么使用slice_min

data_people %>% 
    group_by(Name) %>%
    slice_min(n = 1, order_by = X._Scoring) %>%
    ungroup
# A tibble: 2 x 4
  Name          Information                    Height X._Scoring
  <chr>         <chr>                           <dbl>      <dbl>
1 John Doe      This is an information          NA          0.56
2 Margarita Pan This is an information as well   1.47       0.78
library(dplyr)
data_people %>% 
    group_by(Name) %>% 
    arrange(X._Scoring) %>% 
    filter(!duplicated(Name) & min(X._Scoring))

输出

  Name          Information                    Height X._Scoring
  <chr>         <chr>                           <dbl>      <dbl>
1 John Doe      This is an information          NA          0.56
2 Margarita Pan This is an information as well   1.47       0.78

基础 R 选项 duplicated + ave

subset(
  data_people,
  !duplicated(Name) & ave(rowSums(!is.na(data_people)), Name, FUN = function(x) x == max(x))
)

给予

           Name                    Information Height X._Scoring
1      John Doe         This is an information   1.88       0.89
2 Margarita Pan This is an information as well   1.47       0.78