降低关卡以将其中 2 个视为控制案例。 regression/modelling/statistics 有问题,因为它不是虚拟的?

Drop levels to treat 2 of them as a Control Case. Problems with regression/modelling/statistics since its not dummy?

我偶然发现了关于在我的数据集中使用 droplevel 的疑问。我的“疾病”栏中有 4 个因素。

BD$Etiología <- factor(BD$Etiología, levels=c(0,1,2,3,4) ,
labels= c("Control","Idiop","LMNA","BAG3","Isquémica"), ordered=FALSE)

然后我制作了一个子集,以便仅比较控制病例与其中一种疾病。

BD_C_ID <- subset(BD, Etiología=="Control" | Etiología=="Idiop")

BD_C_ID$Etiología= droplevels(BD_C_ID$Etiología) 

BD_C_ID$Etiología

[1] Control Control Control Control Control Control Control Idiop   Idiop   Control Control Control
[13] Control Idiop   Idiop   Idiop   Idiop   Idiop   Idiop   Idiop   Idiop   Idiop   Idiop   Idiop  
[25] Idiop   Idiop   Control Control Control Control Idiop   Control Control Control Control Control
[37] Idiop   Idiop   Idiop   Idiop  
Levels: Control Idiop

由于第一个因素是无序的,我只是放弃了我不使用的级别。我可以将它们视为 0-1 编码值以便在 lm 或逻辑回归中使用它们吗?还是会有问题?

此外,如果我使用 Control VS BAG3(初始代码中的 0-3?),这是否适用?还是我需要重新调平它们,使其重新应用因子为 0-1?

简短的回答是没关系。如果您在线性模型 lm 或逻辑回归中使用它们,模型将使用第一水平作为参考水平,因此在这种情况下,它始终是 "Control"droplevels() 如果你需要用因子来执行某些功能,那很好,但如果它纯粹是为了 lm()glm(),这些功能会处理下面的因子。

使用您的示例来说明这一点:

set.seed(111)
BD = data.frame(
          Etiologia = sample(0:4,100,replace=TRUE),
          x = rnorm(100),
          y = rnorm(100)
                )

我们可以这样做:

BD$E <- factor(BD$Etiologia,levels=0:4,
labels= c("Control","Idiop","LMNA","BAG3","Isquemica"))

lm(y ~ x + E,data=subset(BD,E %in% c("Control","Idiop")))

Call:
lm(formula = y ~ x + E, data = subset(BD, E %in% c("Control", "Idiop")))

Coefficients:
(Intercept)            x       EIdiop  
   -0.05524      0.21596      0.30433 

并使用另一个比较:

lm(y ~ x + E,data=subset(BD,E %in% c("Control","BAG3")))

     Call:
lm(formula = y ~ x + E, data = subset(BD, E %in% c("Control", 
    "BAG3")))

Coefficients:
(Intercept)            x        EBAG3  
   -0.03355      0.08978     -0.21708  

如果你这样做,你会得到相同的结果:

BD$Etiologia <- factor(BD$Etiologia, levels=c(0,1,2,3,4) ,
labels= c("Control","Idiop","LMNA","BAG3","Isquemica"), ordered=FALSE)

BD_C_ID <- droplevels(subset(BD, Etiologia=="Control" | Etiologia=="Idiop"))

lm(y ~ x + Etiologia,data=BD_C_ID)

Call:
lm(formula = y ~ x + Etiologia, data = BD_C_ID)

Coefficients:
   (Intercept)               x  EtiologiaIdiop  
      -0.05524         0.21596         0.30433