在汇总统计表中添加更高级别的分组
Adding higher level groupings in summary statistic tables
前段时间问了如何做分组汇总table:
我想做与此类似的事情,但还需要几步,但我不确定如何进行。
这是我目前的情况:
data %>%
dplyr::filter_all(all_vars(!is.na(.))) %>%
group_by(Type.Time, Type.Perc, Grp) %>%
dplyr::summarise(mean.ms = sprintf("%.2f", mean(Time, na.rm = TRUE)),
se.ms = sprintf("%.2f", (sd(Time, na.rm = T))/sqrt(data %>% filter(Grp == 1) %>% nrow())),
mean.perc = sprintf("%.2f", mean(Percentage, na.rm = TRUE)),
se.perc = sprintf("%.2f", (sd(Percentage, na.rm = T))/sqrt(data %>% filter(Grp == 1) %>% nrow())),
) %>%
gather(key, value, mean.ms:se.perc) %>%
unite(Group, Grp, key) %>%
spread(Group, value)
这给了我想要的信息,但格式错误且值是原来的两倍:
| Type.Time | Type.Perc | 1_mean.ms | 1_mean.perc | 1_se.ms | 1_se.perc | 2_mean.ms | 2_mean.perc | 2_se.ms | 2_se.perc|
|-----------|-----------|-----------|-------------|---------|-----------|-----------|-------------|---------|----------|
| TType2 | PType2 | 703 | 15 | 15 | 1.4 | 573 | 8 | 22 | 1.3 |
| TType2 | PType1 | 703 | 10 | 15 | 1.8 | 573 | 13 | 22 | 3.1 |
| TType1 | PType2 | 710 | 15 | 18 | 1.4 | 622 | 8 | 29 | 1.3 |
| TType1 | PType1 | 710 | 10 | 18 | 1.8 | 622 | 13 | 29 | 3.1 |
我希望我的新 table 中的顶部分组是 'mean'/'se' 之前的 1 或 2(即 Grp [Group])。然后是 Type1 和 Type 2 的子组,前面的 T 和 P 被拆分为行(分别为 ms 和 %)...所以我的目标是生成一个 table 这种格式:
| Group1 | Group2 |
|------------------------|---------------------------|
| Type1 | Type2 | Type1 | Type2 |
|------------|-----------|------------|--------------|
| M | SE | M | SE | M | SE | M | SE |
|----|-----|------|-----|-----|------|-----|------|-------|
|ms | [values calculated from 'Time' variable] |
|% | [values calculated from 'Percentage' variable] |
我希望这是有道理的!
示例数据:
structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L), Grp = c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), Type.Time = c("TType1",
"TType1", "TType2", "TType2", "TType1", "TType1", "TType2", "TType2",
"TType1", "TType1", "TType2", "TType2", "TType1", "TType1",
"TType2", "TType2"), Time = c(711, 711, 669, 669, 765, 765, 876, 876, 740,
740, 658, 658, 456, 456, 423, 423), Type.Perc = c("PType1",
"PType2", "PType1", "PType2", "PType1", "PType2",
"PType1", "PType2", "PType1", "PType2", "PType1",
"PType2", "PType1", "PType2", "PType1", "PType2"
), Percentage = c(8, 3, 9, 7, 19, 22, 30, 21, 10, 5, 10, 5, 8, 7,
13, 5)), row.names = c(NA, -16L), class = c("tbl_df",
"tbl", "data.frame"))
配置此类 header 分组的一个选项是使用 kableExtra
包。
对于数据准备,我做了两个主要更改 - 仅考虑 Type.Time == Type.Perc
(以避免问题中显示的过多组合),并计算每个类型和组的 SE 值(在示例代码中混合不同的分组,我认为这不是故意的)。
library(tidyverse)
df <- data %>%
dplyr::filter_all(all_vars(!is.na(.))) %>%
dplyr::mutate(
Type = stringr::str_extract(Type.Time, "Type[0-9]"),
Type.Perc = stringr::str_extract(Type.Perc, "Type[0-9]"),
) %>%
dplyr::filter(Type == Type.Perc) %>%
dplyr::select(-Type.Perc, -Type.Time, -ID) %>%
pivot_longer(c(Percentage, Time), names_to = "parameter") %>%
group_by(Type, Grp, parameter) %>%
dplyr::summarise(
mean = sprintf("%.2f", mean(value, na.rm = TRUE)),
se = sprintf("%.2f", (sd(value, na.rm = T))/sqrt(n())),
.groups = "drop"
) %>%
tidyr::pivot_longer(c(mean, se)) %>%
arrange(Grp, Type) %>%
tidyr::pivot_wider(id_cols = "parameter", names_from = c("Grp", "Type", "name"))
# A tibble: 2 x 9
parameter `1_Type1_mean` `1_Type1_se` `1_Type2_mean` `1_Type2_se` `2_Type1_mean`
<chr> <chr> <chr> <chr> <chr> <chr>
1 Percentage 13.50 5.50 14.00 7.00 9.00
2 Time 738.00 27.00 772.50 103.50 598.00
# ... with 3 more variables: `2_Type1_se` <chr>, `2_Type2_mean` <chr>,
# `2_Type2_se` <chr>
这些值已经是正确的格式,我们可以简单地用 add_header_above
定义几个 header 分组。并且 kableExtra
provides plenty of additional options for modyfing the output format.
library(kableExtra)
kable(df, col.names = c("", "M", "SE", "M", "SE", "M", "SE", "M", "SE"),
align = c("l", "r", "r", "r", "r", "r", "r", "r", "r", "r"),
format = "html") %>%
kable_styling() %>%
add_header_above(c(" ", "Type1" = 2, "Type2" = 2, "Type1" = 2, "Type2" = 2)) %>%
add_header_above(c(" ", "Group1" = 4, "Group2" = 4))
前段时间问了如何做分组汇总table:
我想做与此类似的事情,但还需要几步,但我不确定如何进行。
这是我目前的情况:
data %>%
dplyr::filter_all(all_vars(!is.na(.))) %>%
group_by(Type.Time, Type.Perc, Grp) %>%
dplyr::summarise(mean.ms = sprintf("%.2f", mean(Time, na.rm = TRUE)),
se.ms = sprintf("%.2f", (sd(Time, na.rm = T))/sqrt(data %>% filter(Grp == 1) %>% nrow())),
mean.perc = sprintf("%.2f", mean(Percentage, na.rm = TRUE)),
se.perc = sprintf("%.2f", (sd(Percentage, na.rm = T))/sqrt(data %>% filter(Grp == 1) %>% nrow())),
) %>%
gather(key, value, mean.ms:se.perc) %>%
unite(Group, Grp, key) %>%
spread(Group, value)
这给了我想要的信息,但格式错误且值是原来的两倍:
| Type.Time | Type.Perc | 1_mean.ms | 1_mean.perc | 1_se.ms | 1_se.perc | 2_mean.ms | 2_mean.perc | 2_se.ms | 2_se.perc|
|-----------|-----------|-----------|-------------|---------|-----------|-----------|-------------|---------|----------|
| TType2 | PType2 | 703 | 15 | 15 | 1.4 | 573 | 8 | 22 | 1.3 |
| TType2 | PType1 | 703 | 10 | 15 | 1.8 | 573 | 13 | 22 | 3.1 |
| TType1 | PType2 | 710 | 15 | 18 | 1.4 | 622 | 8 | 29 | 1.3 |
| TType1 | PType1 | 710 | 10 | 18 | 1.8 | 622 | 13 | 29 | 3.1 |
我希望我的新 table 中的顶部分组是 'mean'/'se' 之前的 1 或 2(即 Grp [Group])。然后是 Type1 和 Type 2 的子组,前面的 T 和 P 被拆分为行(分别为 ms 和 %)...所以我的目标是生成一个 table 这种格式:
| Group1 | Group2 |
|------------------------|---------------------------|
| Type1 | Type2 | Type1 | Type2 |
|------------|-----------|------------|--------------|
| M | SE | M | SE | M | SE | M | SE |
|----|-----|------|-----|-----|------|-----|------|-------|
|ms | [values calculated from 'Time' variable] |
|% | [values calculated from 'Percentage' variable] |
我希望这是有道理的!
示例数据:
structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L), Grp = c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), Type.Time = c("TType1",
"TType1", "TType2", "TType2", "TType1", "TType1", "TType2", "TType2",
"TType1", "TType1", "TType2", "TType2", "TType1", "TType1",
"TType2", "TType2"), Time = c(711, 711, 669, 669, 765, 765, 876, 876, 740,
740, 658, 658, 456, 456, 423, 423), Type.Perc = c("PType1",
"PType2", "PType1", "PType2", "PType1", "PType2",
"PType1", "PType2", "PType1", "PType2", "PType1",
"PType2", "PType1", "PType2", "PType1", "PType2"
), Percentage = c(8, 3, 9, 7, 19, 22, 30, 21, 10, 5, 10, 5, 8, 7,
13, 5)), row.names = c(NA, -16L), class = c("tbl_df",
"tbl", "data.frame"))
配置此类 header 分组的一个选项是使用 kableExtra
包。
对于数据准备,我做了两个主要更改 - 仅考虑 Type.Time == Type.Perc
(以避免问题中显示的过多组合),并计算每个类型和组的 SE 值(在示例代码中混合不同的分组,我认为这不是故意的)。
library(tidyverse)
df <- data %>%
dplyr::filter_all(all_vars(!is.na(.))) %>%
dplyr::mutate(
Type = stringr::str_extract(Type.Time, "Type[0-9]"),
Type.Perc = stringr::str_extract(Type.Perc, "Type[0-9]"),
) %>%
dplyr::filter(Type == Type.Perc) %>%
dplyr::select(-Type.Perc, -Type.Time, -ID) %>%
pivot_longer(c(Percentage, Time), names_to = "parameter") %>%
group_by(Type, Grp, parameter) %>%
dplyr::summarise(
mean = sprintf("%.2f", mean(value, na.rm = TRUE)),
se = sprintf("%.2f", (sd(value, na.rm = T))/sqrt(n())),
.groups = "drop"
) %>%
tidyr::pivot_longer(c(mean, se)) %>%
arrange(Grp, Type) %>%
tidyr::pivot_wider(id_cols = "parameter", names_from = c("Grp", "Type", "name"))
# A tibble: 2 x 9
parameter `1_Type1_mean` `1_Type1_se` `1_Type2_mean` `1_Type2_se` `2_Type1_mean`
<chr> <chr> <chr> <chr> <chr> <chr>
1 Percentage 13.50 5.50 14.00 7.00 9.00
2 Time 738.00 27.00 772.50 103.50 598.00
# ... with 3 more variables: `2_Type1_se` <chr>, `2_Type2_mean` <chr>,
# `2_Type2_se` <chr>
这些值已经是正确的格式,我们可以简单地用 add_header_above
定义几个 header 分组。并且 kableExtra
provides plenty of additional options for modyfing the output format.
library(kableExtra)
kable(df, col.names = c("", "M", "SE", "M", "SE", "M", "SE", "M", "SE"),
align = c("l", "r", "r", "r", "r", "r", "r", "r", "r", "r"),
format = "html") %>%
kable_styling() %>%
add_header_above(c(" ", "Type1" = 2, "Type2" = 2, "Type1" = 2, "Type2" = 2)) %>%
add_header_above(c(" ", "Group1" = 4, "Group2" = 4))