用基于组的平均值替换 R 数据框中的 NA,并将其应用于多列
Replacing NAs in R dataframe with mean based on group and apply the same to multiple columns
我有这个数据框。
library(tidyverse)
df <- tibble(
"group" = c("A", "A", "B", "B"),
"WC" = c(NA, 2.3, 3.5, 4),
"Sixltr" = c(3.3, NA, NA, 2.7),
"Dic" = c(NA, NA, NA, 2.4),
"I" = c(3.1, 3, 2.7, 1.9),
"We" = c(4.6, NA, 2.2, NA)
)
我创建了 mean_NA_conditional_function
函数来将 NAs
替换为平均值(基于某些条件),然后我使用 lapply
对所有列执行相同的操作数据框 - 然而,这并不重要,我也可以简单地使用常规平均值。
mean_NA_conditional_function <- function(vector) {
# when NA <= 1 in vector, return the mean of available data in vector
if (sum(is.na(vector)) <= 1) {return(mean(vector, na.rm = TRUE))}
# when NA >= 2 in vector, return the sum of available data in vector divided by vector length - 1
if (sum(is.na(vector)) >= 2) {return((sum(vector, na.rm = TRUE)) / (length(vector) - 1))}
}
#Create the 'NAs_replace_function' function that replaces NAs applying the 'mean_NA_conditional_function'.
NAs_replace_function <- function(vector) replace(vector, is.na(vector), mean_NA_conditional_function(vector))
#Apply the function 'NAs_replace_function' to selected columns and replace NAs with appropriate mean.
df_after_imputation <- replace(df, TRUE, lapply(df, NAs_replace_function))
到目前为止,这有效。但是,我想要做的是根据每个值所属的组(即 'A'、'B')替换 NA
。
我试过 group_by()
,但没用。不确定我是否做错了什么。关于如何解决这个问题的任何想法?
# This doesn't work:
df_after_imputation <- df %>% group_by(group) %>% replace(., TRUE, lapply(df, NAs_replace_function))
您可以使用:
library(dplyr)
df %>%
group_by(group) %>%
mutate(across(WC:We, NAs_replace_function)) %>%
ungroup -> df_after_imputation
df_after_imputation
# group WC Sixltr Dic I We
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 A 2.3 3.3 0 3.1 4.6
#2 A 2.3 3.3 0 3 4.6
#3 B 3.5 2.7 2.4 2.7 2.2
#4 B 4 2.7 2.4 1.9 2.2
我有这个数据框。
library(tidyverse)
df <- tibble(
"group" = c("A", "A", "B", "B"),
"WC" = c(NA, 2.3, 3.5, 4),
"Sixltr" = c(3.3, NA, NA, 2.7),
"Dic" = c(NA, NA, NA, 2.4),
"I" = c(3.1, 3, 2.7, 1.9),
"We" = c(4.6, NA, 2.2, NA)
)
我创建了 mean_NA_conditional_function
函数来将 NAs
替换为平均值(基于某些条件),然后我使用 lapply
对所有列执行相同的操作数据框 - 然而,这并不重要,我也可以简单地使用常规平均值。
mean_NA_conditional_function <- function(vector) {
# when NA <= 1 in vector, return the mean of available data in vector
if (sum(is.na(vector)) <= 1) {return(mean(vector, na.rm = TRUE))}
# when NA >= 2 in vector, return the sum of available data in vector divided by vector length - 1
if (sum(is.na(vector)) >= 2) {return((sum(vector, na.rm = TRUE)) / (length(vector) - 1))}
}
#Create the 'NAs_replace_function' function that replaces NAs applying the 'mean_NA_conditional_function'.
NAs_replace_function <- function(vector) replace(vector, is.na(vector), mean_NA_conditional_function(vector))
#Apply the function 'NAs_replace_function' to selected columns and replace NAs with appropriate mean.
df_after_imputation <- replace(df, TRUE, lapply(df, NAs_replace_function))
到目前为止,这有效。但是,我想要做的是根据每个值所属的组(即 'A'、'B')替换 NA
。
我试过 group_by()
,但没用。不确定我是否做错了什么。关于如何解决这个问题的任何想法?
# This doesn't work:
df_after_imputation <- df %>% group_by(group) %>% replace(., TRUE, lapply(df, NAs_replace_function))
您可以使用:
library(dplyr)
df %>%
group_by(group) %>%
mutate(across(WC:We, NAs_replace_function)) %>%
ungroup -> df_after_imputation
df_after_imputation
# group WC Sixltr Dic I We
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 A 2.3 3.3 0 3.1 4.6
#2 A 2.3 3.3 0 3 4.6
#3 B 3.5 2.7 2.4 2.7 2.2
#4 B 4 2.7 2.4 1.9 2.2