当使用 dplyr 为一组列给出最大数量的 NA 值时计算行方向平均值
Calculate the rowwise mean when a maximum number of NA values is given for a set of columns using dplyr
示例数据集...
> tribble(
+ ~colA, ~colB, ~colC, ~colD, ~colE,
+ 1, 2, 3, 4, 5,
+ 2, 3, NA, 4, 5,
+ 3, NA, NA, NA, 4,
+ 4, NA, NA, 5, 6
+ )
# A tibble: 4 × 5
colA colB colC colD colE
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 3 4 5
2 2 3 NA 4 5
3 3 NA NA NA 4
4 4 NA NA 5 6
如果只有两个(最多)NA,我如何创建一个新列来给出列 B、C、D 和 E 的平均值?在这种情况下,第三行的平均值应该是 NA,因为它有 3 个 NA。我放了 colA 是因为我希望能够使用 tidyselect 来选择包含哪些变量。
到目前为止我有这个...
dat %>%
rowwise() %>%
mutate(
mean = if_else(
c_across(colB, colC, colD, colE),
condition = sum(is.na(.)) <= 2,
true = mean(., na.rm = T),
false = NA
)
)
但我收到此错误消息...
Error in `mutate()`:
! Problem while computing `mean = if_else(...)`.
ℹ The error occurred in row 1.
Caused by error in `if_else()`:
! `false` must be a double vector, not a logical vector.
Run `rlang::last_error()` to see where the error occurred.
Warning message:
Problem while computing `mean = if_else(...)`.
ℹ argument is not numeric or logical: returning NA
ℹ The warning occurred in row 1.
在理想情况下,我会有一个函数,用于对一组列和给定数量的允许 NA 取行均值,我可以重新调整用途。
我们可以做到以下几点。这是一个示例,如何 select 一组列 select
in rowSums
和 rowMeans
.
library(dplyr)
dat2 <- dat %>%
mutate(mean = ifelse(rowSums(is.na(select(., -colA))) > 2,
NA,
rowMeans(select(., -colA), na.rm = TRUE)))
在base R
中:
df$mean <- apply(df[-1], 1, \(x) if (sum(is.na(x)) <= 2) mean(x, na.rm = T) else NA)
df
#> # A tibble: 4 x 6
#> colA colB colC colD colE mean
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 3 4 5 3.5
#> 2 2 3 NA 4 5 4
#> 3 3 NA NA NA 4 NA
#> 4 4 NA NA 5 6 5.5
或使用dplyr
:
library(dplyr)
df %>%
mutate(mean = apply(.[-1], 1, \(x) if (sum(is.na(x)) <= 2) mean(x, na.rm = T) else NA))
#> # A tibble: 4 x 6
#> colA colB colC colD colE mean
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 3 4 5 3.5
#> 2 2 3 NA 4 5 4
#> 3 3 NA NA NA 4 NA
#> 4 4 NA NA 5 6 5.5
我们可以使用 across
到 select 感兴趣的列。
library(dplyr)
dat %>%
mutate(mean = ifelse(rowSums(is.na(across(-colA))) > 2,
NA,
rowMeans(across(-colA), na.rm = T)))
# A tibble: 4 × 6
colA colB colC colD colE mean
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 3 4 5 3.5
2 2 3 NA 4 5 4
3 3 NA NA NA 4 NA
4 4 NA NA 5 6 5.5
data.table
选项:
library(data.table)
setDT(df)[!rowSums(is.na(df)) > 2, mean := rowMeans(.SD, na.rm = TRUE), .SDcols = -1]
输出:
colA colB colC colD colE mean
1: 1 2 3 4 5 3.5
2: 2 3 NA 4 5 4.0
3: 3 NA NA NA 4 NA
4: 4 NA NA 5 6 5.5
示例数据集...
> tribble(
+ ~colA, ~colB, ~colC, ~colD, ~colE,
+ 1, 2, 3, 4, 5,
+ 2, 3, NA, 4, 5,
+ 3, NA, NA, NA, 4,
+ 4, NA, NA, 5, 6
+ )
# A tibble: 4 × 5
colA colB colC colD colE
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 3 4 5
2 2 3 NA 4 5
3 3 NA NA NA 4
4 4 NA NA 5 6
如果只有两个(最多)NA,我如何创建一个新列来给出列 B、C、D 和 E 的平均值?在这种情况下,第三行的平均值应该是 NA,因为它有 3 个 NA。我放了 colA 是因为我希望能够使用 tidyselect 来选择包含哪些变量。
到目前为止我有这个...
dat %>%
rowwise() %>%
mutate(
mean = if_else(
c_across(colB, colC, colD, colE),
condition = sum(is.na(.)) <= 2,
true = mean(., na.rm = T),
false = NA
)
)
但我收到此错误消息...
Error in `mutate()`:
! Problem while computing `mean = if_else(...)`.
ℹ The error occurred in row 1.
Caused by error in `if_else()`:
! `false` must be a double vector, not a logical vector.
Run `rlang::last_error()` to see where the error occurred.
Warning message:
Problem while computing `mean = if_else(...)`.
ℹ argument is not numeric or logical: returning NA
ℹ The warning occurred in row 1.
在理想情况下,我会有一个函数,用于对一组列和给定数量的允许 NA 取行均值,我可以重新调整用途。
我们可以做到以下几点。这是一个示例,如何 select 一组列 select
in rowSums
和 rowMeans
.
library(dplyr)
dat2 <- dat %>%
mutate(mean = ifelse(rowSums(is.na(select(., -colA))) > 2,
NA,
rowMeans(select(., -colA), na.rm = TRUE)))
在base R
中:
df$mean <- apply(df[-1], 1, \(x) if (sum(is.na(x)) <= 2) mean(x, na.rm = T) else NA)
df
#> # A tibble: 4 x 6
#> colA colB colC colD colE mean
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 3 4 5 3.5
#> 2 2 3 NA 4 5 4
#> 3 3 NA NA NA 4 NA
#> 4 4 NA NA 5 6 5.5
或使用dplyr
:
library(dplyr)
df %>%
mutate(mean = apply(.[-1], 1, \(x) if (sum(is.na(x)) <= 2) mean(x, na.rm = T) else NA))
#> # A tibble: 4 x 6
#> colA colB colC colD colE mean
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 3 4 5 3.5
#> 2 2 3 NA 4 5 4
#> 3 3 NA NA NA 4 NA
#> 4 4 NA NA 5 6 5.5
我们可以使用 across
到 select 感兴趣的列。
library(dplyr)
dat %>%
mutate(mean = ifelse(rowSums(is.na(across(-colA))) > 2,
NA,
rowMeans(across(-colA), na.rm = T)))
# A tibble: 4 × 6
colA colB colC colD colE mean
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 3 4 5 3.5
2 2 3 NA 4 5 4
3 3 NA NA NA 4 NA
4 4 NA NA 5 6 5.5
data.table
选项:
library(data.table)
setDT(df)[!rowSums(is.na(df)) > 2, mean := rowMeans(.SD, na.rm = TRUE), .SDcols = -1]
输出:
colA colB colC colD colE mean
1: 1 2 3 4 5 3.5
2: 2 3 NA 4 5 4.0
3: 3 NA NA NA 4 NA
4: 4 NA NA 5 6 5.5