处理 R 线性回归中的嵌套变量

Dealing with nested variables in R linear regression

我有一个包含一些嵌套变量的数据集。例如,我有以下变量: 一辆车的speed,是否有另一辆车跟随它other_car,如果有另一辆车,两辆车之间的距离distance。虚拟数据集:

speed <- c(30,50,60,30,33,54,65,33,33,54,65,34,45,32)
other_car <- c(0,1,0,0,0,1,1,1,1,0,1,0,1,0)
distance <- c(NA,20,NA,NA,NA,21,5,15,17,NA,34,NA,13,NA)

dft <- data.frame(speed,other_car,distance)

我想以嵌套变量的形式将变量 other_cardistance 包含在模型中,即如果有汽车,还要考虑距离。按照此处提到的方法: https://stats.stackexchange.com/questions/372257/how-do-you-deal-with-nested-variables-in-a-regression-model ,我尝试了以下操作:

dft <- data.frame(speed,other_car,distance)
dft$other_car<-factor(dft$other_car)

lm_speed <- lm(speed ~ dft$other_car + dft$other_car:dft$distance)
summary(lm_speed)

这给出了以下错误:

Error in contrasts<-(*tmp*, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels

有什么想法吗?

这是因为当other_car==0时,距离都等于NAsee:

dft$distance[dft$other_car==0]
[1] NA NA NA NA NA NA NA

您可以指定一个常数距离来替换 NA 代替 other_car==0,这样模型就可以使用因子 other_car==0 并发现该距离对该子集没有影响:

dft$distance[dft$other_car==0]<-0

dft$other_car<- factor(dft$other_car)

lm_speed <- lm(speed ~ other_car + other_car:distance, data = dft)
summary(lm_speed)

Call:
lm(formula = speed ~ other_car + other_car:distance, data = dft)

Residuals:
    Min      1Q  Median      3Q     Max 
-16.015  -8.500  -3.876   8.894  21.000 

Coefficients: (1 not defined because of singularities)
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)          39.0000     5.0405   7.737 8.96e-06 ***
other_car1            4.6480    13.0670   0.356    0.729    
other_car0:distance       NA         NA      NA       NA    
other_car1:distance   0.3157     0.6133   0.515    0.617    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 13.34 on 11 degrees of freedom
Multiple R-squared:  0.1758,    Adjusted R-squared:  0.026 
F-statistic: 1.174 on 2 and 11 DF,  p-value: 0.3452

另一种解决方法是将 factor 转换为 numeric,但是 isn't the same model:

speed <- c(30,50,60,30,33,54,65,33,33,54,65,34,45,32)
other_car <- c(0,1,0,0,0,1,1,1,1,0,1,0,1,0)
distance <- c(NA,20,NA,NA,NA,21,5,15,17,NA,34,NA,13,NA)

dft <- data.frame(speed,other_car,distance)



dft$other_car<- as.numeric(factor(dft$other_car))

lm_speed <- lm(speed ~ other_car + other_car:distance, data = dft)
summary(lm_speed)

Call:
lm(formula = speed ~ other_car + other_car:distance, data = dft)

Residuals:
        2         6         7         8         9        11        13 
  0.03776   3.72205  19.77341 -15.38369 -16.01511  10.61782  -2.75227 

Coefficients: (1 not defined because of singularities)
                   Estimate Std. Error t value Pr(>|t|)  
(Intercept)         43.6480    12.9010   3.383   0.0196 *
other_car                NA         NA      NA       NA  
other_car:distance   0.1579     0.3281   0.481   0.6508  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 14.27 on 5 degrees of freedom
  (7 observations deleted due to missingness)
Multiple R-squared:  0.04424,   Adjusted R-squared:  -0.1469 
F-statistic: 0.2314 on 1 and 5 DF,  p-value: 0.6508

这表明速度随着与另一辆车的距离增加而增加(或者相反,当另一辆车离得太近时,司机往往会减速)。