多个变量的汇总统计数据,统计数据为行,变量为列?
Summary statistics for multiple variables with statistics as rows and variables as columns?
我正在尝试使用 dplyr::summarize() 和 dplyr::across() 来获取行中包含多个汇总统计信息、列中包含变量的小标题。我只能通过使用 dplyr::bind_rows() 来实现这个结果,但我想知道是否有更优雅的方法来获得相同的输出。
> library(tidyverse)
── Attaching packages ────────────────────────────────────────────── tidyverse 1.3.1 ──
✔ ggplot2 3.3.3 ✔ purrr 0.3.4
✔ tibble 3.1.1 ✔ dplyr 1.0.6
✔ tidyr 1.1.3 ✔ stringr 1.4.0
✔ readr 1.4.0 ✔ forcats 0.5.1
── Conflicts ───────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
>
> bind_rows(min = summarize(starwars, across(where(is.numeric), min,
+ na.rm = TRUE)),
+ median = summarize(starwars, across(where(is.numeric), median,
+ na.rm = TRUE)),
+ mean = summarize(starwars, across(where(is.numeric), mean, na.rm = TRUE)),
+ max = summarize(starwars, across(where(is.numeric), max, na.rm = TRUE)),
+ sd = summarize(starwars, across(where(is.numeric), sd, na.rm = TRUE)),
+ .id = "statistic")
# A tibble: 5 x 4
statistic height mass birth_year
<chr> <dbl> <dbl> <dbl>
1 min 66 15 8
2 median 180 79 52
3 mean 174. 97.3 87.6
4 max 264 1358 896
5 sd 34.8 169. 155.
为什么不能直接用summarize做呢?正如 colwise vignette 所建议的那样,似乎比使用函数列表更优雅。这是否违反了整洁数据框的原则? (在我看来,把一堆数据框放在一起是很不整洁的。)
这解决了你想要的输出,但它并不那么花哨。
starwars %>%
summarise(across(
where(is.numeric),
.fns = list(
min = min,
median = median,
mean = mean,
max = max,
sd = sd
),
na.rm = TRUE,
.names = "{.col}_{.fn}")) %>%
pivot_longer(cols = everything()) %>%
mutate(statistic = str_match(name, pattern = ".+_(.+)")[,2],
name = str_match(name, pattern = "(.+)_.+")[,2]) %>%
pivot_wider(names_from = name, values_from = value)
您可以使用 gtsummary
来汇总数据。下面我子集到数字列(尽管 gtsummary
处理许多不同的数据类型。然后我告诉类型参数将我的摘要统计信息放在不同的行上,最后告诉统计参数我想显示哪些摘要。
library(tidyverse)
library(gtsummary)
starwars[sapply(starwars, is.numeric)] %>%
tbl_summary(type = all_continuous() ~ "continuous2",
statistic = all_continuous() ~ c("{median} ({p25}, {p75})",
"{min}, {max}",
"{mean},{sd}"))
我会这样做:
starwars %>%
summarise(across(where(is.numeric), stat_funs,
na.rm = TRUE, .names = "{.col}__{.fn}")) %>%
pivot_longer(everything()) %>%
separate(name, c('v', 'f'), sep = '__') %>%
pivot_wider(names_from = v, values_from = value)
# f height mass birth_year
# <chr> <dbl> <dbl> <dbl>
# 1 min 66 15 8
# 2 median 180 79 52
# 3 mean 174. 97.3 87.6
# 4 max 264 1358 896
# 5 sd 34.8 169. 155.
这是一种使用 purrr
迭代函数列表的方法。这实际上是您使用 bind_rows()
所做的,但代码更少。
library(dplyr)
library(purrr)
funs <- lst(min, median, mean, max, sd)
map_dfr(funs,
~ summarize(starwars, across(where(is.numeric), .x, na.rm = TRUE)),
.id = "statistic")
# # A tibble: 5 x 4
# statistic height mass birth_year
# <chr> <dbl> <dbl> <dbl>
# 1 min 66 15 8
# 2 median 180 79 52
# 3 mean 174. 97.3 87.6
# 4 max 264 1358 896
# 5 sd 34.8 169. 155.
我正在尝试使用 dplyr::summarize() 和 dplyr::across() 来获取行中包含多个汇总统计信息、列中包含变量的小标题。我只能通过使用 dplyr::bind_rows() 来实现这个结果,但我想知道是否有更优雅的方法来获得相同的输出。
> library(tidyverse)
── Attaching packages ────────────────────────────────────────────── tidyverse 1.3.1 ──
✔ ggplot2 3.3.3 ✔ purrr 0.3.4
✔ tibble 3.1.1 ✔ dplyr 1.0.6
✔ tidyr 1.1.3 ✔ stringr 1.4.0
✔ readr 1.4.0 ✔ forcats 0.5.1
── Conflicts ───────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
>
> bind_rows(min = summarize(starwars, across(where(is.numeric), min,
+ na.rm = TRUE)),
+ median = summarize(starwars, across(where(is.numeric), median,
+ na.rm = TRUE)),
+ mean = summarize(starwars, across(where(is.numeric), mean, na.rm = TRUE)),
+ max = summarize(starwars, across(where(is.numeric), max, na.rm = TRUE)),
+ sd = summarize(starwars, across(where(is.numeric), sd, na.rm = TRUE)),
+ .id = "statistic")
# A tibble: 5 x 4
statistic height mass birth_year
<chr> <dbl> <dbl> <dbl>
1 min 66 15 8
2 median 180 79 52
3 mean 174. 97.3 87.6
4 max 264 1358 896
5 sd 34.8 169. 155.
为什么不能直接用summarize做呢?正如 colwise vignette 所建议的那样,似乎比使用函数列表更优雅。这是否违反了整洁数据框的原则? (在我看来,把一堆数据框放在一起是很不整洁的。)
这解决了你想要的输出,但它并不那么花哨。
starwars %>%
summarise(across(
where(is.numeric),
.fns = list(
min = min,
median = median,
mean = mean,
max = max,
sd = sd
),
na.rm = TRUE,
.names = "{.col}_{.fn}")) %>%
pivot_longer(cols = everything()) %>%
mutate(statistic = str_match(name, pattern = ".+_(.+)")[,2],
name = str_match(name, pattern = "(.+)_.+")[,2]) %>%
pivot_wider(names_from = name, values_from = value)
您可以使用 gtsummary
来汇总数据。下面我子集到数字列(尽管 gtsummary
处理许多不同的数据类型。然后我告诉类型参数将我的摘要统计信息放在不同的行上,最后告诉统计参数我想显示哪些摘要。
library(tidyverse)
library(gtsummary)
starwars[sapply(starwars, is.numeric)] %>%
tbl_summary(type = all_continuous() ~ "continuous2",
statistic = all_continuous() ~ c("{median} ({p25}, {p75})",
"{min}, {max}",
"{mean},{sd}"))
我会这样做:
starwars %>%
summarise(across(where(is.numeric), stat_funs,
na.rm = TRUE, .names = "{.col}__{.fn}")) %>%
pivot_longer(everything()) %>%
separate(name, c('v', 'f'), sep = '__') %>%
pivot_wider(names_from = v, values_from = value)
# f height mass birth_year
# <chr> <dbl> <dbl> <dbl>
# 1 min 66 15 8
# 2 median 180 79 52
# 3 mean 174. 97.3 87.6
# 4 max 264 1358 896
# 5 sd 34.8 169. 155.
这是一种使用 purrr
迭代函数列表的方法。这实际上是您使用 bind_rows()
所做的,但代码更少。
library(dplyr)
library(purrr)
funs <- lst(min, median, mean, max, sd)
map_dfr(funs,
~ summarize(starwars, across(where(is.numeric), .x, na.rm = TRUE)),
.id = "statistic")
# # A tibble: 5 x 4
# statistic height mass birth_year
# <chr> <dbl> <dbl> <dbl>
# 1 min 66 15 8
# 2 median 180 79 52
# 3 mean 174. 97.3 87.6
# 4 max 264 1358 896
# 5 sd 34.8 169. 155.