R:根据特定条件删除重复行
R: Remove duplicates row based on certain criteria
我想根据特定条件删除重复项。
我的数据如下:
Animal<-c("bird","Bird ","Dog","Cat F","Lion","Lion","Lion","dog","Horse","cat", "Lion")
A_date<-c("02-08-2020","20-06-2018","01-01-2015","10-07-2021","20-06-2018","15-08-2019","05-08-2013","20-06-2010","15-11-2016","22-03-2022","15-05-2019")
ID<-c("T1", "T1","T1","T2","T2","T3","T3","T4","T4","T5","T5")
Mydata<-data.frame(Animal, A_date,col_1)
Animal A_date ID
bird 02-08-2020 T1
Bird 20-06-2018 T1
Dog 01-01-2015 T1
Cat F 10-07-2021 T2
Lion 20-06-2018 T2
Lion 15-08-2019 T3
lion 05-08-2013 T3
dog 20-06-2010 T4
Horse 15-11-2016 T4
cat 22-03-2022 T5
Lion 15-05-2019 T5
我想删除重复的行,以便只有具有最新日期 pr 的行。 ID 将保留。例如在上面 table Lion 以相同的 ID 出现了 3 次。所以我只想保留 Lion 15-08-2019 T3
,但我想保留 ID 为 T5 的 Lion。
最终结果应该如下所示:
Animal A_date ID
Dog 01-01-2015 T1
bird 02-08-2020 T1
Dog 01-01-2015 T1
Cat F 10-07-2021 T2
Lion 15-08-2019 T3
dog 20-06-2010 T4
Horse 15-11-2016 T4
cat 22-03-2022 T5
Lion 15-05-2019 T5
我处理的数据非常大,ID 从 T1 到 T20。
我已经尝试了以下代码。但是不能正常使用
library(lubridate)
library(dplyr)
Mydata <- Mydata %>%
mutate(Animal = toupper(Animal), A_date = lubridate::dmy(A_date)) %>%
arrange(A_date)
Mydata %>%
filter(!duplicated(Animal, fromLast = TRUE))
我得到的结果
Animal A_date ID
DOG <NA> T1
HORSE <NA> T4
BIRD <NA> T1
LION <NA> T3
BIRD <NA> T1
CAT F <NA> T2
CAT <NA> T5
这不是我想要的最终结果。
一个选项是按 ID
和 Animal
分组,然后进行排列,使每个组的最近日期位于该组的顶部(即最晚日期),然后slice
那一行。
library(lubridate)
library(dplyr)
Mydata %>%
mutate(Animal = trimws(toupper(Animal)), A_date = lubridate::dmy(A_date)) %>%
group_by(ID, Animal) %>%
arrange(ID, Animal, desc(A_date)) %>%
slice(1)
输出
Animal A_date ID
<chr> <date> <chr>
1 BIRD 2020-08-02 T1
2 DOG 2015-01-01 T1
3 CAT F 2021-07-10 T2
4 LION 2018-06-20 T2
5 LION 2019-08-15 T3
6 DOG 2010-06-20 T4
7 HORSE 2016-11-15 T4
8 CAT 2022-03-22 T5
9 LION 2019-05-15 T5
我们可以尝试 slice_max
而不是 A_date
Mydata %>%
mutate(Animal = toupper(Animal), A_date = lubridate::dmy(A_date)) %>%
group_by(ID, Animal) %>%
slice_max(A_date) %>%
ungroup()
这给出了
# A tibble: 10 x 3
Animal A_date ID
<chr> <date> <chr>
1 "BIRD" 2020-08-02 T1
2 "BIRD " 2018-06-20 T1
3 "DOG" 2015-01-01 T1
4 "CAT F" 2021-07-10 T2
5 "LION" 2018-06-20 T2
6 "LION" 2019-08-15 T3
7 "DOG" 2010-06-20 T4
8 "HORSE" 2016-11-15 T4
9 "CAT" 2022-03-22 T5
10 "LION" 2019-05-15 T5
我想根据特定条件删除重复项。 我的数据如下:
Animal<-c("bird","Bird ","Dog","Cat F","Lion","Lion","Lion","dog","Horse","cat", "Lion")
A_date<-c("02-08-2020","20-06-2018","01-01-2015","10-07-2021","20-06-2018","15-08-2019","05-08-2013","20-06-2010","15-11-2016","22-03-2022","15-05-2019")
ID<-c("T1", "T1","T1","T2","T2","T3","T3","T4","T4","T5","T5")
Mydata<-data.frame(Animal, A_date,col_1)
Animal A_date ID
bird 02-08-2020 T1
Bird 20-06-2018 T1
Dog 01-01-2015 T1
Cat F 10-07-2021 T2
Lion 20-06-2018 T2
Lion 15-08-2019 T3
lion 05-08-2013 T3
dog 20-06-2010 T4
Horse 15-11-2016 T4
cat 22-03-2022 T5
Lion 15-05-2019 T5
我想删除重复的行,以便只有具有最新日期 pr 的行。 ID 将保留。例如在上面 table Lion 以相同的 ID 出现了 3 次。所以我只想保留 Lion 15-08-2019 T3
,但我想保留 ID 为 T5 的 Lion。
最终结果应该如下所示:
Animal A_date ID
Dog 01-01-2015 T1
bird 02-08-2020 T1
Dog 01-01-2015 T1
Cat F 10-07-2021 T2
Lion 15-08-2019 T3
dog 20-06-2010 T4
Horse 15-11-2016 T4
cat 22-03-2022 T5
Lion 15-05-2019 T5
我处理的数据非常大,ID 从 T1 到 T20。 我已经尝试了以下代码。但是不能正常使用
library(lubridate)
library(dplyr)
Mydata <- Mydata %>%
mutate(Animal = toupper(Animal), A_date = lubridate::dmy(A_date)) %>%
arrange(A_date)
Mydata %>%
filter(!duplicated(Animal, fromLast = TRUE))
我得到的结果
Animal A_date ID
DOG <NA> T1
HORSE <NA> T4
BIRD <NA> T1
LION <NA> T3
BIRD <NA> T1
CAT F <NA> T2
CAT <NA> T5
这不是我想要的最终结果。
一个选项是按 ID
和 Animal
分组,然后进行排列,使每个组的最近日期位于该组的顶部(即最晚日期),然后slice
那一行。
library(lubridate)
library(dplyr)
Mydata %>%
mutate(Animal = trimws(toupper(Animal)), A_date = lubridate::dmy(A_date)) %>%
group_by(ID, Animal) %>%
arrange(ID, Animal, desc(A_date)) %>%
slice(1)
输出
Animal A_date ID
<chr> <date> <chr>
1 BIRD 2020-08-02 T1
2 DOG 2015-01-01 T1
3 CAT F 2021-07-10 T2
4 LION 2018-06-20 T2
5 LION 2019-08-15 T3
6 DOG 2010-06-20 T4
7 HORSE 2016-11-15 T4
8 CAT 2022-03-22 T5
9 LION 2019-05-15 T5
我们可以尝试 slice_max
而不是 A_date
Mydata %>%
mutate(Animal = toupper(Animal), A_date = lubridate::dmy(A_date)) %>%
group_by(ID, Animal) %>%
slice_max(A_date) %>%
ungroup()
这给出了
# A tibble: 10 x 3
Animal A_date ID
<chr> <date> <chr>
1 "BIRD" 2020-08-02 T1
2 "BIRD " 2018-06-20 T1
3 "DOG" 2015-01-01 T1
4 "CAT F" 2021-07-10 T2
5 "LION" 2018-06-20 T2
6 "LION" 2019-08-15 T3
7 "DOG" 2010-06-20 T4
8 "HORSE" 2016-11-15 T4
9 "CAT" 2022-03-22 T5
10 "LION" 2019-05-15 T5