在 tidyverse 中收集和总结步骤后保持因子顺序
Keeping factor order after gather and summarise steps in tidyverse
我试图计算一百多个变量的频率和百分比。如何维护输出中每个变量值的因子顺序?请注意,为数据集外的每个变量指定顺序是不切实际的,因为我有超过 100 个变量。
示例数据:
df <- data.frame(gender=factor(c("male", "female", "male", NA), levels=c("male", "female")),
disease=factor(c("yes","yes","no", NA), levels=c("yes", "no")))
df
gender disease
1 male yes
2 female yes
3 male no
4 <NA> <NA>
尝试:
df %>% gather(key, value, factor_key = T) %>%
group_by(key, value) %>%
summarise(n=n()) %>%
ungroup() %>%
group_by(key) %>%
mutate(percent=n/sum(n))
输出:
# A tibble: 6 x 4
# Groups: key [2]
key value n percent
<fct> <chr> <int> <dbl>
1 gender female 1 0.25
2 gender male 2 0.5
3 gender NA 1 0.25
4 disease no 1 0.25
5 disease yes 2 0.5
6 disease NA 1 0.25
期望的输出将性别排序为男性、女性,疾病排序为是、否。
更新:如果您使用 pivot_longer(新聚集),它会保留因子水平!您还可以在 pivot_longer.
中 fine-tune 带有参数 names_transform 和 values_transform 的列类型
library(tidyverse)
df <- data.frame(gender=factor(c("male", "female", "male", NA), levels=c("male", "female")),
disease=factor(c("yes","yes","no", NA), levels=c("yes", "no")))
df %>%
pivot_longer(everything()) %>%
group_by(name, value) %>%
summarise(n=n(), .groups = "drop_last") %>%
mutate(percent=n/sum(n))
#> # A tibble: 6 x 4
#> # Groups: name [2]
#> name value n percent
#> <chr> <fct> <int> <dbl>
#> 1 disease yes 2 0.5
#> 2 disease no 1 0.25
#> 3 disease <NA> 1 0.25
#> 4 gender male 2 0.5
#> 5 gender female 1 0.25
#> 6 gender <NA> 1 0.25
由 reprex 包 (v0.3.0) 创建于 2020-10-16
因为 gather 删除了值变量的因子并且 summarize 似乎也删除了数据框属性,所以您必须 re-add 它们。您可以 re-add 通过读入并组合因子水平,将它们 semi-automated 成 semi-automated:
library(tidyverse)
df <- data.frame(gender=factor(c("male", "female", "male", NA), levels=c("male", "female")),
disease=factor(c("yes","yes","no", NA), levels=c("yes", "no")))
df %>%
gather(key, value, factor_key = T) %>%
group_by(key, value) %>%
summarise(n=n()) %>%
ungroup() %>%
group_by(key) %>%
mutate(percent=n/sum(n),
value = factor(value, levels = df %>% map(levels) %>% unlist())) %>%
arrange(key, value)
#> Warning: attributes are not identical across measure variables;
#> they will be dropped
#> `summarise()` regrouping output by 'key' (override with `.groups` argument)
#> # A tibble: 6 x 4
#> # Groups: key [2]
#> key value n percent
#> <fct> <fct> <int> <dbl>
#> 1 gender male 2 0.5
#> 2 gender female 1 0.25
#> 3 gender <NA> 1 0.25
#> 4 disease yes 2 0.5
#> 5 disease no 1 0.25
#> 6 disease <NA> 1 0.25
由 reprex 包 (v0.3.0) 创建于 2020-10-16
我试图计算一百多个变量的频率和百分比。如何维护输出中每个变量值的因子顺序?请注意,为数据集外的每个变量指定顺序是不切实际的,因为我有超过 100 个变量。
示例数据:
df <- data.frame(gender=factor(c("male", "female", "male", NA), levels=c("male", "female")),
disease=factor(c("yes","yes","no", NA), levels=c("yes", "no")))
df
gender disease
1 male yes
2 female yes
3 male no
4 <NA> <NA>
尝试:
df %>% gather(key, value, factor_key = T) %>%
group_by(key, value) %>%
summarise(n=n()) %>%
ungroup() %>%
group_by(key) %>%
mutate(percent=n/sum(n))
输出:
# A tibble: 6 x 4
# Groups: key [2]
key value n percent
<fct> <chr> <int> <dbl>
1 gender female 1 0.25
2 gender male 2 0.5
3 gender NA 1 0.25
4 disease no 1 0.25
5 disease yes 2 0.5
6 disease NA 1 0.25
期望的输出将性别排序为男性、女性,疾病排序为是、否。
更新:如果您使用 pivot_longer(新聚集),它会保留因子水平!您还可以在 pivot_longer.
中 fine-tune 带有参数 names_transform 和 values_transform 的列类型library(tidyverse)
df <- data.frame(gender=factor(c("male", "female", "male", NA), levels=c("male", "female")),
disease=factor(c("yes","yes","no", NA), levels=c("yes", "no")))
df %>%
pivot_longer(everything()) %>%
group_by(name, value) %>%
summarise(n=n(), .groups = "drop_last") %>%
mutate(percent=n/sum(n))
#> # A tibble: 6 x 4
#> # Groups: name [2]
#> name value n percent
#> <chr> <fct> <int> <dbl>
#> 1 disease yes 2 0.5
#> 2 disease no 1 0.25
#> 3 disease <NA> 1 0.25
#> 4 gender male 2 0.5
#> 5 gender female 1 0.25
#> 6 gender <NA> 1 0.25
由 reprex 包 (v0.3.0) 创建于 2020-10-16
因为 gather 删除了值变量的因子并且 summarize 似乎也删除了数据框属性,所以您必须 re-add 它们。您可以 re-add 通过读入并组合因子水平,将它们 semi-automated 成 semi-automated:
library(tidyverse)
df <- data.frame(gender=factor(c("male", "female", "male", NA), levels=c("male", "female")),
disease=factor(c("yes","yes","no", NA), levels=c("yes", "no")))
df %>%
gather(key, value, factor_key = T) %>%
group_by(key, value) %>%
summarise(n=n()) %>%
ungroup() %>%
group_by(key) %>%
mutate(percent=n/sum(n),
value = factor(value, levels = df %>% map(levels) %>% unlist())) %>%
arrange(key, value)
#> Warning: attributes are not identical across measure variables;
#> they will be dropped
#> `summarise()` regrouping output by 'key' (override with `.groups` argument)
#> # A tibble: 6 x 4
#> # Groups: key [2]
#> key value n percent
#> <fct> <fct> <int> <dbl>
#> 1 gender male 2 0.5
#> 2 gender female 1 0.25
#> 3 gender <NA> 1 0.25
#> 4 disease yes 2 0.5
#> 5 disease no 1 0.25
#> 6 disease <NA> 1 0.25
由 reprex 包 (v0.3.0) 创建于 2020-10-16