映射非数字因子以在 R 中的两列之间选择更高的值
Mapping non-numeric factor to choose higher value between two columns in R
我有一个包含两列的数据框:PathGroupStage、ClinGroupStage。我想创建一个新列 OutputStage,它选择更高的阶段。
阶段的有效值:I、IA、IB、II、IIA、IIB、III、IIIA、IIIB、IIIC、IV、IVA、IVB、IVC、未知。
- 如果两个阶段都有值,则使用最高值,例如 IIIB > IIIA > III
- 如果缺少一个而另一个有价值,则使用有价值的那个
- 如果两者都缺失或未知,则 .= unknown
我如何通过比较两列的非数字值来导出 OutputStage 变量?我在想我需要考虑因素水平,但我如何比较不同列之间的因素?
这是示例数据集:
PathGroupStage ClinGroupStage
1 II <NA>
2 I IA
3 IVB IVB
4 IIIA Unknown/Not Reported
5 I III
6 II <NA>
7 IIIA IIB
8 II II
9 <NA> <NA>
10 IIIB Unknown/Not Reported
df <- structure(list(PathGroupStage = c("II", "I", "IVB", "IIIA", "I",
"II", "IIIA", "II", NA, "IIIB"), ClinGroupStage = c(NA, "IA",
"IVB", "Unknown/Not Reported", "III", NA, "IIB", "II", NA, "Unknown/Not Reported"
)), row.names = c(NA, 10L), class = "data.frame")
一个选项可以是:
stages <- c("Unknown/Not Reported", "I", "IA", "IB", "II", "IIA", "IIB", "III", "IIIA", "IIIB", "IIIC" ,"IV", "IVA", "IVB", "IVC")
df %>%
mutate(across(everything(), ~ factor(., levels = stages, ordered = TRUE)),
OutputStage = pmax(PathGroupStage, ClinGroupStage, na.rm = TRUE))
PathGroupStage ClinGroupStage OutputStage
1 II <NA> II
2 I IA IA
3 IVB IVB IVB
4 IIIA Unknown/Not Reported IIIA
5 I III III
6 II <NA> II
7 IIIA IIB IIIA
8 II II II
9 <NA> <NA> <NA>
10 IIIB Unknown/Not Reported IIIB
df <- structure(
list(
PathGroupStage = c("II", "I", "IVB", "IIIA", "I", "II", "IIIA", "II", NA, "IIIB"),
ClinGroupStage = c(NA, "IA", "IVB", "Unknown/Not Reported", "III", NA, "IIB", "II", NA, "Unknown/Not Reported")
),
row.names = c(NA, 10L), class = "data.frame"
)
# The variables are not yet factors as far as R is concerned as you can
# see from the tibble print method
df %>% as_tibble()
stages <- c("I", "IA", "IB", "II", "IIA", "IIB", "III", "IIIA", "IIIB", "IIIC" ,"IV", "IVA", "IVB", "IVC", "Unknown/Not Reported")
df %>%
as_tibble() %>%
dplyr::mutate(
# if we make them ordered factors then they now have values you can do a mathematical operation on
PathGroupStage = factor(PathGroupStage, levels = stages, ordered = TRUE),
ClinGroupStage = factor(ClinGroupStage, levels = stages, ordered = TRUE),
# case when is like a more general if_else() with multiple conditions
# of the form: logical test ~ result if true
OutputStage = case_when(
(is.na(ClinGroupStage) | ClinGroupStage == "Unknown/Not Reported") &
(is.na(PathGroupStage) | PathGroupStage == "Unknown/Not Reported") ~
factor("Unknown/Not Reported", levels = stages, ordered = TRUE),
is.na(PathGroupStage) ~ ClinGroupStage,
is.na(ClinGroupStage) ~ PathGroupStage,
PathGroupStage >= ClinGroupStage ~ PathGroupStage,
ClinGroupStage >= PathGroupStage ~ ClinGroupStage
)
)
我有一个包含两列的数据框:PathGroupStage、ClinGroupStage。我想创建一个新列 OutputStage,它选择更高的阶段。
阶段的有效值:I、IA、IB、II、IIA、IIB、III、IIIA、IIIB、IIIC、IV、IVA、IVB、IVC、未知。
- 如果两个阶段都有值,则使用最高值,例如 IIIB > IIIA > III
- 如果缺少一个而另一个有价值,则使用有价值的那个
- 如果两者都缺失或未知,则 .= unknown
我如何通过比较两列的非数字值来导出 OutputStage 变量?我在想我需要考虑因素水平,但我如何比较不同列之间的因素?
这是示例数据集:
PathGroupStage ClinGroupStage
1 II <NA>
2 I IA
3 IVB IVB
4 IIIA Unknown/Not Reported
5 I III
6 II <NA>
7 IIIA IIB
8 II II
9 <NA> <NA>
10 IIIB Unknown/Not Reported
df <- structure(list(PathGroupStage = c("II", "I", "IVB", "IIIA", "I",
"II", "IIIA", "II", NA, "IIIB"), ClinGroupStage = c(NA, "IA",
"IVB", "Unknown/Not Reported", "III", NA, "IIB", "II", NA, "Unknown/Not Reported"
)), row.names = c(NA, 10L), class = "data.frame")
一个选项可以是:
stages <- c("Unknown/Not Reported", "I", "IA", "IB", "II", "IIA", "IIB", "III", "IIIA", "IIIB", "IIIC" ,"IV", "IVA", "IVB", "IVC")
df %>%
mutate(across(everything(), ~ factor(., levels = stages, ordered = TRUE)),
OutputStage = pmax(PathGroupStage, ClinGroupStage, na.rm = TRUE))
PathGroupStage ClinGroupStage OutputStage
1 II <NA> II
2 I IA IA
3 IVB IVB IVB
4 IIIA Unknown/Not Reported IIIA
5 I III III
6 II <NA> II
7 IIIA IIB IIIA
8 II II II
9 <NA> <NA> <NA>
10 IIIB Unknown/Not Reported IIIB
df <- structure(
list(
PathGroupStage = c("II", "I", "IVB", "IIIA", "I", "II", "IIIA", "II", NA, "IIIB"),
ClinGroupStage = c(NA, "IA", "IVB", "Unknown/Not Reported", "III", NA, "IIB", "II", NA, "Unknown/Not Reported")
),
row.names = c(NA, 10L), class = "data.frame"
)
# The variables are not yet factors as far as R is concerned as you can
# see from the tibble print method
df %>% as_tibble()
stages <- c("I", "IA", "IB", "II", "IIA", "IIB", "III", "IIIA", "IIIB", "IIIC" ,"IV", "IVA", "IVB", "IVC", "Unknown/Not Reported")
df %>%
as_tibble() %>%
dplyr::mutate(
# if we make them ordered factors then they now have values you can do a mathematical operation on
PathGroupStage = factor(PathGroupStage, levels = stages, ordered = TRUE),
ClinGroupStage = factor(ClinGroupStage, levels = stages, ordered = TRUE),
# case when is like a more general if_else() with multiple conditions
# of the form: logical test ~ result if true
OutputStage = case_when(
(is.na(ClinGroupStage) | ClinGroupStage == "Unknown/Not Reported") &
(is.na(PathGroupStage) | PathGroupStage == "Unknown/Not Reported") ~
factor("Unknown/Not Reported", levels = stages, ordered = TRUE),
is.na(PathGroupStage) ~ ClinGroupStage,
is.na(ClinGroupStage) ~ PathGroupStage,
PathGroupStage >= ClinGroupStage ~ PathGroupStage,
ClinGroupStage >= PathGroupStage ~ ClinGroupStage
)
)