当存在缺失值时,对多个函数使用汇总
Using summarize across with multiple functions when there are missing values
如果我想使用 mtcars
数据集获取所有数字列的平均值和总和,我将使用以下代码:
group_by(gear) %>%
summarise(across(where(is.numeric), list(mean = mean, sum = sum)))
但是如果我在某些列中有缺失值,我该如何考虑?这是一个可重现的例子:
test.df1 <- data.frame("Year" = sample(2018:2020, 20, replace = TRUE),
"Firm" = head(LETTERS, 5),
"Exporter"= sample(c("Yes", "No"), 20, replace = TRUE),
"Revenue" = sample(100:200, 20, replace = TRUE),
stringsAsFactors = FALSE)
test.df1 <- rbind(test.df1,
data.frame("Year" = c(2018, 2018),
"Firm" = c("Y", "Z"),
"Exporter" = c("Yes", "No"),
"Revenue" = c(NA, NA)))
test.df1 <- test.df1 %>% mutate(Profit = Revenue - sample(20:30, 22, replace = TRUE ))
test.df_summarized <- test.df1 %>% group_by(Firm) %>% summarize(across(where(is.numeric)), list(mean = mean, sum = sum)))
如果我只是 summarize
每个变量单独,我可以使用以下内容:
test.df1 %>% group_by(Firm) %>% summarize(Revenue_mean = mean(Revenue, na.rm = TRUE,
Profit_mean = mean(Profit, na.rm = TRUE)
但我想弄清楚如何将上面为 mtcars
编写的代码调整为我在此处提供的示例数据集。
因为您的函数都有一个 na.rm
参数,您可以将它与 ...
一起传递
test.df1 %>% summarize(across(where(is.numeric), list(mean = mean, sum = sum), na.rm = TRUE))
# Year_mean Year_sum Revenue_mean Revenue_sum Profit_mean Profit_sum
# 1 2019.045 44419 162.35 3247 138.25 2765
(我遗漏了 group_by
因为它没有在您的代码中正确指定并且没有它的示例仍然很好地说明。还要确保您的函数在 内部 across()
.)
郑重声明,您也可以这样做(当不同的函数有不同的参数时这会起作用)
test.df1 %>%
summarise(across(where(is.numeric),
list(
mean = ~ mean(.x, na.rm = T),
sum = ~ sum(.x, na.rm = T))
)
)
# Year_mean Year_sum Revenue_mean Revenue_sum Profit_mean Profit_sum
# 1 2019.045 44419 144.05 2881 119.3 2386
如果我想使用 mtcars
数据集获取所有数字列的平均值和总和,我将使用以下代码:
group_by(gear) %>%
summarise(across(where(is.numeric), list(mean = mean, sum = sum)))
但是如果我在某些列中有缺失值,我该如何考虑?这是一个可重现的例子:
test.df1 <- data.frame("Year" = sample(2018:2020, 20, replace = TRUE),
"Firm" = head(LETTERS, 5),
"Exporter"= sample(c("Yes", "No"), 20, replace = TRUE),
"Revenue" = sample(100:200, 20, replace = TRUE),
stringsAsFactors = FALSE)
test.df1 <- rbind(test.df1,
data.frame("Year" = c(2018, 2018),
"Firm" = c("Y", "Z"),
"Exporter" = c("Yes", "No"),
"Revenue" = c(NA, NA)))
test.df1 <- test.df1 %>% mutate(Profit = Revenue - sample(20:30, 22, replace = TRUE ))
test.df_summarized <- test.df1 %>% group_by(Firm) %>% summarize(across(where(is.numeric)), list(mean = mean, sum = sum)))
如果我只是 summarize
每个变量单独,我可以使用以下内容:
test.df1 %>% group_by(Firm) %>% summarize(Revenue_mean = mean(Revenue, na.rm = TRUE,
Profit_mean = mean(Profit, na.rm = TRUE)
但我想弄清楚如何将上面为 mtcars
编写的代码调整为我在此处提供的示例数据集。
因为您的函数都有一个 na.rm
参数,您可以将它与 ...
test.df1 %>% summarize(across(where(is.numeric), list(mean = mean, sum = sum), na.rm = TRUE))
# Year_mean Year_sum Revenue_mean Revenue_sum Profit_mean Profit_sum
# 1 2019.045 44419 162.35 3247 138.25 2765
(我遗漏了 group_by
因为它没有在您的代码中正确指定并且没有它的示例仍然很好地说明。还要确保您的函数在 内部 across()
.)
郑重声明,您也可以这样做(当不同的函数有不同的参数时这会起作用)
test.df1 %>%
summarise(across(where(is.numeric),
list(
mean = ~ mean(.x, na.rm = T),
sum = ~ sum(.x, na.rm = T))
)
)
# Year_mean Year_sum Revenue_mean Revenue_sum Profit_mean Profit_sum
# 1 2019.045 44419 144.05 2881 119.3 2386