根据多列删除重复项,但 select 至少 NA 的 "most" 完整版本的重复项
Delete Duplicates based on multiple columns but select the "most" complete version of the duplicates by least NA's
我有一个看起来像这样的代码
Month| Day| Year| Color| Weather|Location|Transporation|ID
Jan Tue 2020 Blue Warm Hospital NA 1
Jan Tue 2020 Blue Warm NA NA 1
Jan Tue 2020 Blue NA NA NA 1
Feb Thu 2020 Red NA NA NA 2
Feb Thu 2020 Red Warm NA NA 2
Feb Thu 2020 Red Warm Garden Run 2
Mar Thu 2020 Red Cold Desk Bus 3
我希望它看起来像这样
Month| Day| Year| Color| Weather|Location| Transporation|ID
Jan Tue 2020 Blue Warm Hospital NA 1
Feb Thu 2020 Red Warm Garden Run 2
Mar Thu 2020 Red Cold Desk Bus 3
基本上我想通过选择三个 c(ID,Month,Color)
来确定列是否重复。一旦确定重复项,我希望它删除具有最多 NA 或“最少完成”的那个,因为在较少的列中填充。
也许这行得通,我做了 rowSums(is.na()) 来按行列出有多少缺失的项目,然后按 ID、月份、颜色分组,并过滤到数量最少的行缺少:
library(dplyr)
dat<-data.frame("Month" = c("Jan", "Jan", "Jan", "Feb", "Feb", "Feb", "Mar"),
"Day" = c("Tue", "Tue", "Tue", "Thu", "Thu", "Thu", "Thu"),
"Year" = rep(2020,7),
"Color" = c(rep("Blue", 3), rep("Red", 4)),
"Weather" = c("Warm", "Warm", NA, NA, "Warm", "Warm", "Cold"),
"Location" = c("Hospital", rep(NA, 4), "Garden", "Desk"),
"Transporation" = c(rep(NA, 5), "Run", "Bus"),
"ID" = c(1, 1, 1, 2, 2, 2, 3)
)%>%
mutate(Missing = rowSums(is.na(.)))%>% #Making a sum of how many missing items per row
group_by(ID, Month, Color)%>%
filter(Missing == min(Missing))%>% #Filtering to the least amount of missing
ungroup()%>%
select(-Missing) #Removing the missing column as it was only used to filter
我们可以使用 order
到 select 按感兴趣的列分组后的第一个非 NA 元素
library(dplyr)
dat %>%
group_by(Month, Day, Year) %>%
summarise(across(everything(), ~ first(.[order(is.na(.))])), .groups = 'drop')
-输出
# A tibble: 3 x 8
Month Day Year Color Weather Location Transporation ID
<chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 Feb Thu 2020 Red Warm Garden Run 2
2 Jan Tue 2020 Blue Warm Hospital <NA> 1
3 Mar Thu 2020 Red Cold Desk Bus 3
数据
dat <- structure(list(Month = c("Jan", "Jan", "Jan", "Feb", "Feb", "Feb",
"Mar"), Day = c("Tue", "Tue", "Tue", "Thu", "Thu", "Thu", "Thu"
), Year = c(2020, 2020, 2020, 2020, 2020, 2020, 2020), Color = c("Blue",
"Blue", "Blue", "Red", "Red", "Red", "Red"), Weather = c("Warm",
"Warm", NA, NA, "Warm", "Warm", "Cold"), Location = c("Hospital",
NA, NA, NA, NA, "Garden", "Desk"), Transporation = c(NA, NA,
NA, NA, NA, "Run", "Bus"), ID = c(1, 1, 1, 2, 2, 2, 3)), class = "data.frame", row.names = c(NA,
-7L))
使用 data.table 库,如果你的数据已经在 j:
j <- as.data.table(your_data)
j
Month Day Year Color Weather Location Transporation ID
<char> <char> <int> <char> <char> <char> <char> <int>
1: Jan Tue 2020 Blue Warm Hospital <NA> 1
2: Jan Tue 2020 Blue Warm <NA> <NA> 1
3: Jan Tue 2020 Blue <NA> <NA> <NA> 1
4: Feb Thu 2020 Red <NA> <NA> <NA> 2
5: Feb Thu 2020 Red Warm <NA> <NA> 2
6: Feb Thu 2020 Red Warm Garden Run 2
7: Mar Thu 2020 Red Cold Desk Bus 3
j$n_na <- apply(j, MARGIN = 1, function(x) sum(is.na(x)))
setorder(j,n_na)
k <- unique(j,by=c("ID","Month","Color"))
setorder(k,ID)
k
Month Day Year Color Weather Location Transporation ID n_na
<char> <char> <int> <char> <char> <char> <char> <int> <int>
1: Jan Tue 2020 Blue Warm Hospital <NA> 1 1
2: Feb Thu 2020 Red Warm Garden Run 2 0
3: Mar Thu 2020 Red Cold Desk Bus 3 0
毕竟 k 将根据您的要求保存数据。
此致,米格尔
我有一个看起来像这样的代码
Month| Day| Year| Color| Weather|Location|Transporation|ID
Jan Tue 2020 Blue Warm Hospital NA 1
Jan Tue 2020 Blue Warm NA NA 1
Jan Tue 2020 Blue NA NA NA 1
Feb Thu 2020 Red NA NA NA 2
Feb Thu 2020 Red Warm NA NA 2
Feb Thu 2020 Red Warm Garden Run 2
Mar Thu 2020 Red Cold Desk Bus 3
我希望它看起来像这样
Month| Day| Year| Color| Weather|Location| Transporation|ID
Jan Tue 2020 Blue Warm Hospital NA 1
Feb Thu 2020 Red Warm Garden Run 2
Mar Thu 2020 Red Cold Desk Bus 3
基本上我想通过选择三个 c(ID,Month,Color)
来确定列是否重复。一旦确定重复项,我希望它删除具有最多 NA 或“最少完成”的那个,因为在较少的列中填充。
也许这行得通,我做了 rowSums(is.na()) 来按行列出有多少缺失的项目,然后按 ID、月份、颜色分组,并过滤到数量最少的行缺少:
library(dplyr)
dat<-data.frame("Month" = c("Jan", "Jan", "Jan", "Feb", "Feb", "Feb", "Mar"),
"Day" = c("Tue", "Tue", "Tue", "Thu", "Thu", "Thu", "Thu"),
"Year" = rep(2020,7),
"Color" = c(rep("Blue", 3), rep("Red", 4)),
"Weather" = c("Warm", "Warm", NA, NA, "Warm", "Warm", "Cold"),
"Location" = c("Hospital", rep(NA, 4), "Garden", "Desk"),
"Transporation" = c(rep(NA, 5), "Run", "Bus"),
"ID" = c(1, 1, 1, 2, 2, 2, 3)
)%>%
mutate(Missing = rowSums(is.na(.)))%>% #Making a sum of how many missing items per row
group_by(ID, Month, Color)%>%
filter(Missing == min(Missing))%>% #Filtering to the least amount of missing
ungroup()%>%
select(-Missing) #Removing the missing column as it was only used to filter
我们可以使用 order
到 select 按感兴趣的列分组后的第一个非 NA 元素
library(dplyr)
dat %>%
group_by(Month, Day, Year) %>%
summarise(across(everything(), ~ first(.[order(is.na(.))])), .groups = 'drop')
-输出
# A tibble: 3 x 8
Month Day Year Color Weather Location Transporation ID
<chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 Feb Thu 2020 Red Warm Garden Run 2
2 Jan Tue 2020 Blue Warm Hospital <NA> 1
3 Mar Thu 2020 Red Cold Desk Bus 3
数据
dat <- structure(list(Month = c("Jan", "Jan", "Jan", "Feb", "Feb", "Feb",
"Mar"), Day = c("Tue", "Tue", "Tue", "Thu", "Thu", "Thu", "Thu"
), Year = c(2020, 2020, 2020, 2020, 2020, 2020, 2020), Color = c("Blue",
"Blue", "Blue", "Red", "Red", "Red", "Red"), Weather = c("Warm",
"Warm", NA, NA, "Warm", "Warm", "Cold"), Location = c("Hospital",
NA, NA, NA, NA, "Garden", "Desk"), Transporation = c(NA, NA,
NA, NA, NA, "Run", "Bus"), ID = c(1, 1, 1, 2, 2, 2, 3)), class = "data.frame", row.names = c(NA,
-7L))
使用 data.table 库,如果你的数据已经在 j:
j <- as.data.table(your_data)
j
Month Day Year Color Weather Location Transporation ID
<char> <char> <int> <char> <char> <char> <char> <int>
1: Jan Tue 2020 Blue Warm Hospital <NA> 1
2: Jan Tue 2020 Blue Warm <NA> <NA> 1
3: Jan Tue 2020 Blue <NA> <NA> <NA> 1
4: Feb Thu 2020 Red <NA> <NA> <NA> 2
5: Feb Thu 2020 Red Warm <NA> <NA> 2
6: Feb Thu 2020 Red Warm Garden Run 2
7: Mar Thu 2020 Red Cold Desk Bus 3
j$n_na <- apply(j, MARGIN = 1, function(x) sum(is.na(x)))
setorder(j,n_na)
k <- unique(j,by=c("ID","Month","Color"))
setorder(k,ID)
k
Month Day Year Color Weather Location Transporation ID n_na
<char> <char> <int> <char> <char> <char> <char> <int> <int>
1: Jan Tue 2020 Blue Warm Hospital <NA> 1 1
2: Feb Thu 2020 Red Warm Garden Run 2 0
3: Mar Thu 2020 Red Cold Desk Bus 3 0
毕竟 k 将根据您的要求保存数据。 此致,米格尔