从两行中删除具有更多 NA 的那一行
Of two rows eliminate the one with more NAs
我正在寻找一种方法来检查数据框中的两列是否包含一行或多行的相同元素,然后消除包含更多 NA 的行。
假设我们有这样一个数据框:
x <- data.frame("Year" = c(2017,2017,2017,2018,2018),
"Country" = c("Sweden", "Sweden", "Norway", "Denmark", "Finland"),
"Sales" = c(15, 15, 18, 13, 12),
"Campaigns" = c(3, NA, 4, 1, 1),
"Employees" = c(15, 15, 12, 8, 9),
"Satisfaction" = c(0.8, NA, 0.9, 0.95, 0.87),
"Expenses" = c(NA, NA, 9000, 7500, 4300))
请注意,瑞典在 2017 年的条目出现了两次,但第一行有一个条目带有 NA,而另一个条目在三个地方包含 NA。现在我想检查两行是否包含相同的 "Year" 和 "Country",然后继续消除包含更多 NA 的行,在本例中为第二行。我做了一些研究,但我似乎无法为这个特殊案例找到解决方案。
非常感谢您。
我们可以使用data.table方法
library(data.table)
ind <- setDT(x)[, {
i1 <- Reduce(`+`, lapply(.SD, is.na))
.I[i1 > 0 & (i1 == max(i1))]
}, .(Year, Country)]$V1
x[-ind]
# Year Country Sales Campaigns Employees Satisfaction Expenses
#1: 2017 Sweden 15 3 15 0.80 NA
#2: 2017 Norway 18 4 12 0.90 9000
#3: 2018 Denmark 13 1 8 0.95 7500
#4: 2018 Finland 12 1 9 0.87 4300
使用dplyr
:
library(dplyr)
x %>%
mutate(n_na = rowSums(is.na(.))) %>% ## calculate NAs for each row
group_by(Year, Country) %>% ## for each year/country
arrange(n_na) %>% ## sort by number of NAs
slice(1) %>% ## take the first row
select(-n_na) ## remove the NA counter column
# A tibble: 4 x 7
# Groups: Year, Country [4]
Year Country Sales Campaigns Employees Satisfaction Expenses
<dbl> <fctr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2017 Norway 18 4 12 0.90 9000
2 2017 Sweden 15 3 15 0.80 NA
3 2018 Denmark 13 1 8 0.95 7500
4 2018 Finland 12 1 9 0.87 4300
基础 R 解决方案:
x$nas <- rowSums(sapply(x, is.na))
do.call(rbind,
by(x, x[c("Year","Country")],
function(df) head(df[order(df$nas),,drop=FALSE], n=1)))
# Year Country Sales Campaigns Employees Satisfaction Expenses nas
# 4 2018 Denmark 13 1 8 0.95 7500 0
# 5 2018 Finland 12 1 9 0.87 4300 0
# 3 2017 Norway 18 4 12 0.90 9000 0
# 1 2017 Sweden 15 3 15 0.80 NA 1
不足为奇,data.table
实现速度很快,尽管我有点惊讶它比基础 R 快多少。作为一个小数据集可能会影响这一点。(在基准测试中,我必须创建原始副本,因为 data.table
就地修改数据,因此 x
不再是 data.frame
。)
microbenchmark(
data.table = {
x0 <- copy(x)
ind <- setDT(x0)[, {
i1 <- Reduce(`+`, lapply(.SD, is.na))
.I[i1 > 0 & (i1 == max(i1))]
}, .(Year, Country)]$V1
x0[-ind]
},
dplyr = {
x %>%
mutate(n_na = rowSums(is.na(.))) %>% ## calculate NAs for each row
group_by(Year, Country) %>% ## for each year/country
arrange(n_na) %>% ## sort by number of NAs
slice(1) %>% ## take the first row
select(-n_na) ## remove the NA counter column
},
base = {
x0 <- x
x0$nas <- rowSums(sapply(x0, is.na))
do.call(rbind,
by(x0, x0[c("Year","Country")],
function(df) head(df[order(df$nas),,drop=FALSE], n=1)))
}
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# data.table 1.223477 1.441005 1.973714 1.582861 1.919090 12.837569 100
# dplyr 2.675239 2.901882 4.465172 3.079295 3.806453 42.261540 100
# base 2.039615 2.209187 2.737758 2.298714 2.570760 8.586946 100
我正在寻找一种方法来检查数据框中的两列是否包含一行或多行的相同元素,然后消除包含更多 NA 的行。
假设我们有这样一个数据框:
x <- data.frame("Year" = c(2017,2017,2017,2018,2018),
"Country" = c("Sweden", "Sweden", "Norway", "Denmark", "Finland"),
"Sales" = c(15, 15, 18, 13, 12),
"Campaigns" = c(3, NA, 4, 1, 1),
"Employees" = c(15, 15, 12, 8, 9),
"Satisfaction" = c(0.8, NA, 0.9, 0.95, 0.87),
"Expenses" = c(NA, NA, 9000, 7500, 4300))
请注意,瑞典在 2017 年的条目出现了两次,但第一行有一个条目带有 NA,而另一个条目在三个地方包含 NA。现在我想检查两行是否包含相同的 "Year" 和 "Country",然后继续消除包含更多 NA 的行,在本例中为第二行。我做了一些研究,但我似乎无法为这个特殊案例找到解决方案。
非常感谢您。
我们可以使用data.table方法
library(data.table)
ind <- setDT(x)[, {
i1 <- Reduce(`+`, lapply(.SD, is.na))
.I[i1 > 0 & (i1 == max(i1))]
}, .(Year, Country)]$V1
x[-ind]
# Year Country Sales Campaigns Employees Satisfaction Expenses
#1: 2017 Sweden 15 3 15 0.80 NA
#2: 2017 Norway 18 4 12 0.90 9000
#3: 2018 Denmark 13 1 8 0.95 7500
#4: 2018 Finland 12 1 9 0.87 4300
使用dplyr
:
library(dplyr)
x %>%
mutate(n_na = rowSums(is.na(.))) %>% ## calculate NAs for each row
group_by(Year, Country) %>% ## for each year/country
arrange(n_na) %>% ## sort by number of NAs
slice(1) %>% ## take the first row
select(-n_na) ## remove the NA counter column
# A tibble: 4 x 7
# Groups: Year, Country [4]
Year Country Sales Campaigns Employees Satisfaction Expenses
<dbl> <fctr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2017 Norway 18 4 12 0.90 9000
2 2017 Sweden 15 3 15 0.80 NA
3 2018 Denmark 13 1 8 0.95 7500
4 2018 Finland 12 1 9 0.87 4300
基础 R 解决方案:
x$nas <- rowSums(sapply(x, is.na))
do.call(rbind,
by(x, x[c("Year","Country")],
function(df) head(df[order(df$nas),,drop=FALSE], n=1)))
# Year Country Sales Campaigns Employees Satisfaction Expenses nas
# 4 2018 Denmark 13 1 8 0.95 7500 0
# 5 2018 Finland 12 1 9 0.87 4300 0
# 3 2017 Norway 18 4 12 0.90 9000 0
# 1 2017 Sweden 15 3 15 0.80 NA 1
不足为奇,data.table
实现速度很快,尽管我有点惊讶它比基础 R 快多少。作为一个小数据集可能会影响这一点。(在基准测试中,我必须创建原始副本,因为 data.table
就地修改数据,因此 x
不再是 data.frame
。)
microbenchmark(
data.table = {
x0 <- copy(x)
ind <- setDT(x0)[, {
i1 <- Reduce(`+`, lapply(.SD, is.na))
.I[i1 > 0 & (i1 == max(i1))]
}, .(Year, Country)]$V1
x0[-ind]
},
dplyr = {
x %>%
mutate(n_na = rowSums(is.na(.))) %>% ## calculate NAs for each row
group_by(Year, Country) %>% ## for each year/country
arrange(n_na) %>% ## sort by number of NAs
slice(1) %>% ## take the first row
select(-n_na) ## remove the NA counter column
},
base = {
x0 <- x
x0$nas <- rowSums(sapply(x0, is.na))
do.call(rbind,
by(x0, x0[c("Year","Country")],
function(df) head(df[order(df$nas),,drop=FALSE], n=1)))
}
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# data.table 1.223477 1.441005 1.973714 1.582861 1.919090 12.837569 100
# dplyr 2.675239 2.901882 4.465172 3.079295 3.806453 42.261540 100
# base 2.039615 2.209187 2.737758 2.298714 2.570760 8.586946 100