有序因子变量的标记

Question

我正在尝试使用 gtsummary 包生成单变量输出 table。

structure(list(id = 1:10, age = structure(c(3L, 3L, 2L, 3L, 2L, 
2L, 2L, 1L, 1L, 1L), .Label = c("c", "b", "a"), class = c("ordered", 
"factor")), sex = structure(c(2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 
1L, 2L), .Label = c("F", "M"), class = "factor"), country = structure(c(1L, 
1L, 1L, 1L, 3L, 3L, 3L, 2L, 2L, 2L), .Label = c("eng", "scot", 
"wale"), class = "factor"), edu = structure(c(1L, 1L, 1L, 2L, 
2L, 2L, 3L, 3L, 3L, 3L), .Label = c("x", "y", "z"), class = "factor"), 
lungfunction = c(45L, 23L, 25L, 45L, 70L, 69L, 90L, 50L, 
62L, 45L), ivdays = c(15L, 26L, 36L, 34L, 2L, 4L, 5L, 8L, 
9L, 15L), no2 = c(40L, 70L, 50L, 60L, 30L, 25L, 80L, 89L, 
10L, 40L), pm25 = c(15L, 20L, 36L, 48L, 25L, 36L, 28L, 15L, 
25L, 15L)), row.names = c(NA, 10L), class = "data.frame")

...
library(gtsummary)
publication_dummytable1_sum %>% 
select(sex,age,lungfunction,ivdays) %>% 
tbl_uvregression(
method =lm,
y = lungfunction,
pvalue_fun = ~style_pvalue(.x, digits = 3)
) %>% 
add_global_p() %>%  # add global p-value 
bold_p() %>%        # bold p-values under a given threshold
bold_labels()
...

当我运行这段代码时，我得到下面的输出。问题是有序因子变量 (age) 的标签。 R 为有序因子变量选择自己的标签。是否可以告诉 R 不要为有序因子变量选择自己的标签？

我想要如下输出：

Answer 1

删除有序变量的奇数标签的最简单方法是从这些因子变量中删除有序的 class。示例如下！

library(gtsummary)
library(tidyverse)
packageVersion("gtsummary")
#> [1] '1.4.2'

publication_dummytable1_sum <- 
  structure(list(id = 1:10, age = structure(c(3L, 3L, 2L, 3L, 2L, 
                                              2L, 2L, 1L, 1L, 1L), .Label = c("c", "b", "a"), class = c("ordered", 
                                                                                                        "factor")), sex = structure(c(2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 
                                                                                                                                      1L, 2L), .Label = c("F", "M"), class = "factor"), country = structure(c(1L, 
                                                                                                                                                                                                              1L, 1L, 1L, 3L, 3L, 3L, 2L, 2L, 2L), .Label = c("eng", "scot", 
                                                                                                                                                                                                                                                              "wale"), class = "factor"), edu = structure(c(1L, 1L, 1L, 2L, 
                                                                                                                                                                                                                                                                                                            2L, 2L, 3L, 3L, 3L, 3L), .Label = c("x", "y", "z"), class = "factor"), 
                 lungfunction = c(45L, 23L, 25L, 45L, 70L, 69L, 90L, 50L, 
                                  62L, 45L), ivdays = c(15L, 26L, 36L, 34L, 2L, 4L, 5L, 8L, 
                                                        9L, 15L), no2 = c(40L, 70L, 50L, 60L, 30L, 25L, 80L, 89L, 
                                                                          10L, 40L), pm25 = c(15L, 20L, 36L, 48L, 25L, 36L, 28L, 15L, 
                                                                                              25L, 15L)), row.names = c(NA, 10L), class = "data.frame") |>
  as_tibble()

# R labels the order factors like this in lm()
lm(lungfunction ~ age, publication_dummytable1_sum)
#> 
#> Call:
#> lm(formula = lungfunction ~ age, data = publication_dummytable1_sum)
#> 
#> Coefficients:
#> (Intercept)        age.L        age.Q  
#>       51.17       -10.37       -15.11


tbl <-
  publication_dummytable1_sum %>% 
  # remove ordered class
  mutate(across(where(is.ordered), ~factor(., ordered = FALSE))) %>%
  select(sex,age,lungfunction,ivdays) %>% 
  tbl_uvregression(
    method =lm,
    y = lungfunction,
    pvalue_fun = ~style_pvalue(.x, digits = 3)
  )

^{由 reprex package (v2.0.0)}

创建于 2021-07-22

Answer 2

像许多其他人一样，我认为您可能误解了 R 中“有序”因子的含义。R 中的所有 个因子在某种意义上都是有序的；估计值等通常以 levels 向量的 顺序打印、绘制等。指定因子的类型为 ordered 有两个主要影响：

它允许您评估因子水平上的不等式（例如，您可以 filter(age > "b")）

对比度默认设置为正交多项式对比度，这是L（线性）和Q（二次）标签的来源: 见例如this CrossValidated answer 了解更多详情。

如果您希望以与常规因素相同的方式处理此变量（以便对基线水平的组差异进行估计，即处理对比），您可以：

转换回无序因子（例如 factor(age, ordered=FALSE)）

指定您要在模型中使用处理对比（在基础 R 中您将指定 contrasts = list(age = "contr.treatment")）

设置options(contrasts = c(unordered = "contr.treatment", ordered = "contr.treatment"))（ordered的默认值为“contr.poly”）

如果您有一个无序的（“常规”）因子并且 级别不是您想要的顺序，您可以通过明确指定级别来重置级别顺序，例如

mutate(across(age, factor, levels = c("0-10 years", "11-20 years", "21-30 years", "30-40 years")))

R 默认按字母顺序设置因子，这有时不是你想要的（但我想不出顺序是 'random' ...）

有序因子变量的标记

labelling of ordered factor variable

statistics

r

gtsummary