根据数据框中行中的条件删除重复项
Remove duplicates based on conditions in rows in a dataframe
我有一个包含许多重复名称的数据框,下面是一个可重现的示例。
我正在尝试通过删除具有重复名称和最少信息的行来清理数据集。
我添加了一列,我在其中计算每行中单元格的 % 的 NA,在我的示例中,我将其称为 %_Scoring.
在重复的名称行中,我想保留具有 最低 的行%_Scoring(占 NA 的百分比)
N:B如果%_Scoring相等,没关系,两行中的一行还是要去掉的。
data_people <- "https://raw.githubusercontent.com/max9nc9/Temp/main/data_people.csv"
data_people <- read.csv(data_people, sep = ",")
在上面的数据示例中,我只保留 2 行:
- 第一排是 Margarita Pan
- 第二行是 John Doe,其中 %_Scoring =
0.56
按'Name'分组后使用slice_max
library(dplyr)
data_people %>%
group_by(Name) %>%
slice_max(n = 1, order_by = X._Scoring) %>%
ungroup
-输出
# A tibble: 2 x 4
Name Information Height X._Scoring
<chr> <chr> <dbl> <dbl>
1 John Doe This is an information 1.88 0.89
2 Margarita Pan This is an information as well 1.47 0.78
或者如果我们想保持最小值,那么使用slice_min
data_people %>%
group_by(Name) %>%
slice_min(n = 1, order_by = X._Scoring) %>%
ungroup
# A tibble: 2 x 4
Name Information Height X._Scoring
<chr> <chr> <dbl> <dbl>
1 John Doe This is an information NA 0.56
2 Margarita Pan This is an information as well 1.47 0.78
library(dplyr)
data_people %>%
group_by(Name) %>%
arrange(X._Scoring) %>%
filter(!duplicated(Name) & min(X._Scoring))
输出
Name Information Height X._Scoring
<chr> <chr> <dbl> <dbl>
1 John Doe This is an information NA 0.56
2 Margarita Pan This is an information as well 1.47 0.78
基础 R 选项 duplicated
+ ave
subset(
data_people,
!duplicated(Name) & ave(rowSums(!is.na(data_people)), Name, FUN = function(x) x == max(x))
)
给予
Name Information Height X._Scoring
1 John Doe This is an information 1.88 0.89
2 Margarita Pan This is an information as well 1.47 0.78
我有一个包含许多重复名称的数据框,下面是一个可重现的示例。
我正在尝试通过删除具有重复名称和最少信息的行来清理数据集。
我添加了一列,我在其中计算每行中单元格的 % 的 NA,在我的示例中,我将其称为 %_Scoring.
在重复的名称行中,我想保留具有 最低 的行%_Scoring(占 NA 的百分比)
N:B如果%_Scoring相等,没关系,两行中的一行还是要去掉的。
data_people <- "https://raw.githubusercontent.com/max9nc9/Temp/main/data_people.csv"
data_people <- read.csv(data_people, sep = ",")
在上面的数据示例中,我只保留 2 行:
- 第一排是 Margarita Pan
- 第二行是 John Doe,其中 %_Scoring = 0.56
按'Name'分组后使用slice_max
library(dplyr)
data_people %>%
group_by(Name) %>%
slice_max(n = 1, order_by = X._Scoring) %>%
ungroup
-输出
# A tibble: 2 x 4
Name Information Height X._Scoring
<chr> <chr> <dbl> <dbl>
1 John Doe This is an information 1.88 0.89
2 Margarita Pan This is an information as well 1.47 0.78
或者如果我们想保持最小值,那么使用slice_min
data_people %>%
group_by(Name) %>%
slice_min(n = 1, order_by = X._Scoring) %>%
ungroup
# A tibble: 2 x 4
Name Information Height X._Scoring
<chr> <chr> <dbl> <dbl>
1 John Doe This is an information NA 0.56
2 Margarita Pan This is an information as well 1.47 0.78
library(dplyr)
data_people %>%
group_by(Name) %>%
arrange(X._Scoring) %>%
filter(!duplicated(Name) & min(X._Scoring))
输出
Name Information Height X._Scoring
<chr> <chr> <dbl> <dbl>
1 John Doe This is an information NA 0.56
2 Margarita Pan This is an information as well 1.47 0.78
基础 R 选项 duplicated
+ ave
subset(
data_people,
!duplicated(Name) & ave(rowSums(!is.na(data_people)), Name, FUN = function(x) x == max(x))
)
给予
Name Information Height X._Scoring
1 John Doe This is an information 1.88 0.89
2 Margarita Pan This is an information as well 1.47 0.78