使用 dplyr tidyr 在摘要 table 中保留输入变量和因子水平的顺序
Preserve order of input variables and factor levels in summary table, using dplyr tidyr
我喜欢 dplyr
和 tidyr
如此轻松地创建具有多个预测变量和结果变量的单个摘要 table。让我难过的一件事是 preserving/defining 输出 table 中预测变量的顺序及其因子水平的最后一步。
我想出了一个解决方案(如下),其中涉及使用 mutate
手动创建一个结合了预测变量和预测变量值的因子变量(例如 "gender_female")具有所需输出顺序的级别。但是如果变量比较多,我的方案就有点啰嗦了,请问有没有更好的办法?
library(dplyr)
library(tidyr)
levels_eth <- c("Maori", "Pacific", "Asian", "Other", "European", "Unknown")
levels_gnd <- c("Female", "Male", "Unknown")
set.seed(1234)
dat <- data.frame(
gender = factor(sample(levels_gnd, 100, replace = TRUE), levels = levels_gnd),
ethnicity = factor(sample(levels_eth, 100, replace = TRUE), levels = levels_eth),
outcome1 = sample(c(TRUE, FALSE), 100, replace = TRUE),
outcome2 = sample(c(TRUE, FALSE), 100, replace = TRUE)
)
dat %>%
gather(key = outcome, value = outcome_value, contains("outcome")) %>%
gather(key = predictor, value = pred_value, gender, ethnicity) %>%
# Statement below creates variable for ordering output
mutate(
pred_ord = factor(interaction(predictor, addNA(pred_value), sep = "_"),
levels = c(paste("gender", levels(addNA(dat$gender)), sep = "_"),
paste("ethnicity", levels(addNA(dat$ethnicity)), sep = "_")))
) %>%
group_by(pred_ord, outcome) %>%
summarise(n = sum(outcome_value, na.rm = TRUE)) %>%
ungroup() %>%
spread(key = outcome, value = n) %>%
separate(pred_ord, c("Predictor", "Pred_value"))
Source: local data frame [9 x 4]
Predictor Pred_value outcome1 outcome2
(chr) (chr) (int) (int)
1 gender Female 25 27
2 gender Male 11 10
3 gender Unknown 12 15
4 ethnicity Maori 10 9
5 ethnicity Pacific 7 7
6 ethnicity Asian 6 12
7 ethnicity Other 10 9
8 ethnicity European 5 4
9 ethnicity Unknown 10 11
Warning message:
attributes are not identical across measure variables; they will be dropped
上面的 table 是正确的,因为 Predictor 和 Predictor 值都没有按字母顺序排序。
编辑
根据要求,这是使用默认排序(字母顺序)时生成的内容。这是有道理的,因为当这些因素组合在一起时,它们被转换为一个字符变量,所有属性都被删除。
dat %>%
gather(key = outcome, value = outcome_value, contains("outcome")) %>%
gather(key = predictor, value = pred_value, gender, ethnicity) %>%
group_by(predictor, pred_value, outcome) %>%
summarise(n = sum(outcome_value, na.rm = TRUE)) %>%
spread(key = outcome, value = n)
Source: local data frame [9 x 4]
predictor pred_value outcome1 outcome2
(chr) (chr) (int) (int)
1 ethnicity Asian 6 12
2 ethnicity European 5 4
3 ethnicity Maori 10 9
4 ethnicity Other 10 9
5 ethnicity Pacific 7 7
6 ethnicity Unknown 10 11
7 gender Female 25 27
8 gender Male 11 10
9 gender Unknown 12 15
Warning message:
attributes are not identical across measure variables; they will be dropped
你可以在没有特殊包的情况下以更简洁有效的方式完成此操作:
rbind(aggregate(dat[,colnames(dat) %in% c("outcome1", "outcome2")],
by = list(dat$gender), sum),
aggregate(dat[,colnames(dat) %in% c("outcome1", "outcome2")],
by = list(dat$ethnicity), sum))
它以简单直接的方式汇总了多个预测变量和结果变量,并且还避免了必须创建属于您提到的复杂解决方案的一部分的变量。
Group.1 outcome1 outcome2
1 Female 25 27
2 Male 11 10
3 Unknown 12 15
4 Maori 10 9
5 Pacific 7 7
6 Asian 6 12
7 Other 10 9
8 European 5 4
9 Unknown 10 11
如果您想重命名上面的列,只需将其分配给一个对象(例如 mytable <-
)并重命名它们(即 colnames(mytable) <- c("Pred_value", "outcome1", "outcome2")
)。如果要键入的变量太多,您也可以使用 apply
来扩大它。
如果您希望您的数据是这样排列的因子,您需要将它们转换回因子,因为 gather
强制转换为字符(它会警告您)。您可以使用 gather
的 factor_key
参数来处理 predictor
,但是 pred_value
需要 assemble 级别,因为它现在结合了两个因素从原来的。简化一点:
library(tidyr)
library(dplyr)
dat %>%
gather(key = predictor, value = pred_value, gender, ethnicity, factor_key = TRUE) %>%
group_by(predictor, pred_value) %>%
summarise_all(sum) %>%
ungroup() %>%
mutate(pred_value = factor(pred_value, levels = unique(c(levels_eth, levels_gnd),
fromLast = TRUE))) %>%
arrange(predictor, pred_value)
## # A tibble: 9 × 4
## predictor pred_value outcome1 outcome2
## <fctr> <fctr> <int> <int>
## 1 gender Female 25 27
## 2 gender Male 11 10
## 3 gender Unknown 12 15
## 4 ethnicity Maori 10 9
## 5 ethnicity Pacific 7 7
## 6 ethnicity Asian 6 12
## 7 ethnicity Other 10 9
## 8 ethnicity European 5 4
## 9 ethnicity Unknown 10 11
请注意,您需要使用 unique
和 fromLast = TRUE
将重复的 "Unknown" 值排列在正确的位置; union
会提早一点
您可以在变量前加上强制它们按正确顺序排列的值,例如“X1_gender”、“X2_ethnicity”。前缀可以在末尾用 mutate 去除。这可能不是一个“整洁”的解决方案,但它对我的目的有用,解决了导致我出现此问题的问题 post。
library(dplyr)
library(tidyr)
levels_eth <- c("Maori", "Pacific", "Asian", "Other", "European", "Unknown")
levels_gnd <- c("Female", "Male", "Unknown")
set.seed(1234)
dat <- data.frame(
X1_gender = factor(sample(levels_gnd, 100, replace = TRUE), levels = levels_gnd),
X2_ethnicity = factor(sample(levels_eth, 100, replace = TRUE), levels = levels_eth),
outcome1 = sample(c(TRUE, FALSE), 100, replace = TRUE),
outcome2 = sample(c(TRUE, FALSE), 100, replace = TRUE)
)
dat %>%
gather(key = outcome, value = outcome_value, contains("outcome")) %>%
gather(key = predictor, value = pred_value, X1_gender, X2_ethnicity) %>%
group_by(predictor, pred_value, outcome) %>%
summarise(n = sum(outcome_value, na.rm = TRUE)) %>%
spread(key = outcome, value = n) %>%
mutate(predictor=gsub("^X[0-9]_","", predictor))
结果:
`summarise()` regrouping output by 'predictor', 'pred_value' (override with
`.groups` argument)
# A tibble: 9 x 4
# Groups: predictor, pred_value [9]
predictor pred_value outcome1 outcome2
<chr> <chr> <int> <int>
1 gender Female 16 21
2 gender Male 12 15
3 gender Unknown 18 16
4 ethnicity Asian 4 6
5 ethnicity European 13 13
6 ethnicity Maori 4 6
7 ethnicity Other 7 11
8 ethnicity Pacific 10 9
9 ethnicity Unknown 8 7
Warning message:
attributes are not identical across measure variables;
they will be dropped
我喜欢 dplyr
和 tidyr
如此轻松地创建具有多个预测变量和结果变量的单个摘要 table。让我难过的一件事是 preserving/defining 输出 table 中预测变量的顺序及其因子水平的最后一步。
我想出了一个解决方案(如下),其中涉及使用 mutate
手动创建一个结合了预测变量和预测变量值的因子变量(例如 "gender_female")具有所需输出顺序的级别。但是如果变量比较多,我的方案就有点啰嗦了,请问有没有更好的办法?
library(dplyr)
library(tidyr)
levels_eth <- c("Maori", "Pacific", "Asian", "Other", "European", "Unknown")
levels_gnd <- c("Female", "Male", "Unknown")
set.seed(1234)
dat <- data.frame(
gender = factor(sample(levels_gnd, 100, replace = TRUE), levels = levels_gnd),
ethnicity = factor(sample(levels_eth, 100, replace = TRUE), levels = levels_eth),
outcome1 = sample(c(TRUE, FALSE), 100, replace = TRUE),
outcome2 = sample(c(TRUE, FALSE), 100, replace = TRUE)
)
dat %>%
gather(key = outcome, value = outcome_value, contains("outcome")) %>%
gather(key = predictor, value = pred_value, gender, ethnicity) %>%
# Statement below creates variable for ordering output
mutate(
pred_ord = factor(interaction(predictor, addNA(pred_value), sep = "_"),
levels = c(paste("gender", levels(addNA(dat$gender)), sep = "_"),
paste("ethnicity", levels(addNA(dat$ethnicity)), sep = "_")))
) %>%
group_by(pred_ord, outcome) %>%
summarise(n = sum(outcome_value, na.rm = TRUE)) %>%
ungroup() %>%
spread(key = outcome, value = n) %>%
separate(pred_ord, c("Predictor", "Pred_value"))
Source: local data frame [9 x 4]
Predictor Pred_value outcome1 outcome2
(chr) (chr) (int) (int)
1 gender Female 25 27
2 gender Male 11 10
3 gender Unknown 12 15
4 ethnicity Maori 10 9
5 ethnicity Pacific 7 7
6 ethnicity Asian 6 12
7 ethnicity Other 10 9
8 ethnicity European 5 4
9 ethnicity Unknown 10 11
Warning message:
attributes are not identical across measure variables; they will be dropped
上面的 table 是正确的,因为 Predictor 和 Predictor 值都没有按字母顺序排序。
编辑
根据要求,这是使用默认排序(字母顺序)时生成的内容。这是有道理的,因为当这些因素组合在一起时,它们被转换为一个字符变量,所有属性都被删除。
dat %>%
gather(key = outcome, value = outcome_value, contains("outcome")) %>%
gather(key = predictor, value = pred_value, gender, ethnicity) %>%
group_by(predictor, pred_value, outcome) %>%
summarise(n = sum(outcome_value, na.rm = TRUE)) %>%
spread(key = outcome, value = n)
Source: local data frame [9 x 4]
predictor pred_value outcome1 outcome2
(chr) (chr) (int) (int)
1 ethnicity Asian 6 12
2 ethnicity European 5 4
3 ethnicity Maori 10 9
4 ethnicity Other 10 9
5 ethnicity Pacific 7 7
6 ethnicity Unknown 10 11
7 gender Female 25 27
8 gender Male 11 10
9 gender Unknown 12 15
Warning message:
attributes are not identical across measure variables; they will be dropped
你可以在没有特殊包的情况下以更简洁有效的方式完成此操作:
rbind(aggregate(dat[,colnames(dat) %in% c("outcome1", "outcome2")],
by = list(dat$gender), sum),
aggregate(dat[,colnames(dat) %in% c("outcome1", "outcome2")],
by = list(dat$ethnicity), sum))
它以简单直接的方式汇总了多个预测变量和结果变量,并且还避免了必须创建属于您提到的复杂解决方案的一部分的变量。
Group.1 outcome1 outcome2 1 Female 25 27 2 Male 11 10 3 Unknown 12 15 4 Maori 10 9 5 Pacific 7 7 6 Asian 6 12 7 Other 10 9 8 European 5 4 9 Unknown 10 11
如果您想重命名上面的列,只需将其分配给一个对象(例如 mytable <-
)并重命名它们(即 colnames(mytable) <- c("Pred_value", "outcome1", "outcome2")
)。如果要键入的变量太多,您也可以使用 apply
来扩大它。
如果您希望您的数据是这样排列的因子,您需要将它们转换回因子,因为 gather
强制转换为字符(它会警告您)。您可以使用 gather
的 factor_key
参数来处理 predictor
,但是 pred_value
需要 assemble 级别,因为它现在结合了两个因素从原来的。简化一点:
library(tidyr)
library(dplyr)
dat %>%
gather(key = predictor, value = pred_value, gender, ethnicity, factor_key = TRUE) %>%
group_by(predictor, pred_value) %>%
summarise_all(sum) %>%
ungroup() %>%
mutate(pred_value = factor(pred_value, levels = unique(c(levels_eth, levels_gnd),
fromLast = TRUE))) %>%
arrange(predictor, pred_value)
## # A tibble: 9 × 4
## predictor pred_value outcome1 outcome2
## <fctr> <fctr> <int> <int>
## 1 gender Female 25 27
## 2 gender Male 11 10
## 3 gender Unknown 12 15
## 4 ethnicity Maori 10 9
## 5 ethnicity Pacific 7 7
## 6 ethnicity Asian 6 12
## 7 ethnicity Other 10 9
## 8 ethnicity European 5 4
## 9 ethnicity Unknown 10 11
请注意,您需要使用 unique
和 fromLast = TRUE
将重复的 "Unknown" 值排列在正确的位置; union
会提早一点
您可以在变量前加上强制它们按正确顺序排列的值,例如“X1_gender”、“X2_ethnicity”。前缀可以在末尾用 mutate 去除。这可能不是一个“整洁”的解决方案,但它对我的目的有用,解决了导致我出现此问题的问题 post。
library(dplyr)
library(tidyr)
levels_eth <- c("Maori", "Pacific", "Asian", "Other", "European", "Unknown")
levels_gnd <- c("Female", "Male", "Unknown")
set.seed(1234)
dat <- data.frame(
X1_gender = factor(sample(levels_gnd, 100, replace = TRUE), levels = levels_gnd),
X2_ethnicity = factor(sample(levels_eth, 100, replace = TRUE), levels = levels_eth),
outcome1 = sample(c(TRUE, FALSE), 100, replace = TRUE),
outcome2 = sample(c(TRUE, FALSE), 100, replace = TRUE)
)
dat %>%
gather(key = outcome, value = outcome_value, contains("outcome")) %>%
gather(key = predictor, value = pred_value, X1_gender, X2_ethnicity) %>%
group_by(predictor, pred_value, outcome) %>%
summarise(n = sum(outcome_value, na.rm = TRUE)) %>%
spread(key = outcome, value = n) %>%
mutate(predictor=gsub("^X[0-9]_","", predictor))
结果:
`summarise()` regrouping output by 'predictor', 'pred_value' (override with
`.groups` argument)
# A tibble: 9 x 4
# Groups: predictor, pred_value [9]
predictor pred_value outcome1 outcome2
<chr> <chr> <int> <int>
1 gender Female 16 21
2 gender Male 12 15
3 gender Unknown 18 16
4 ethnicity Asian 4 6
5 ethnicity European 13 13
6 ethnicity Maori 4 6
7 ethnicity Other 7 11
8 ethnicity Pacific 10 9
9 ethnicity Unknown 8 7
Warning message:
attributes are not identical across measure variables;
they will be dropped