使用虚拟变量交互项回归时的 NA 值

NA values when regressing with dummy variable interaction term

我正在尝试估算决定纽约和芝加哥居民幸福水平差异的因素。

数据如下所示。

  Happiness     City Gender Employment   Worktype      Holiday
1        60 New York      0        0     Unemployed   Unemployed
2        80  Chicago      1        1     Whitecolor 1 day a week
3        39  Chicago      0        0     Unemployed   Unemployed
4        40 New York      1        0     Unemployed   Unemployed
5        69  Chicago      1        1     Bluecolor  2 day a week
6        90  Chicago      1        1     Bluecolor  2 day a week
7       100 New York      0        1     Whitecolor 2 day a week
8        30 New York      1        1     Whitecolor 1 day a week

幸福水平是因变量,'city'是人住的地方。 'Gender' 编码为 0 = 男性 1 = 女性。 'Employment' 是 0 = 失业和 1 = 就业。 'Worktype'是三级因子:'Unemployed'、'Whitecolor'、'Bluecolor'。 'Holiday'是一个人一周休息多少天。这里 'City'、'Gender'、'Worktype' 和 'Holiday' 变量都是因子。 'Happiness' 和 'Employment' 变量类型是数字。

我要估算的模型是

lm(Happiness ~ City + Gender + Employment:(Worktype + Holiday))

我将 'Employment' 值保留为数值,因此如果 'Employment' 等于 0(失业),0:(Worktype + Holiday) = 0,因此模型会自动缩减为

lm(Happiness ~ City + Gender)

失业人员。

然而,回归结果 returns NA 值。

Coefficients: (2 not defined because of singularities)
                               Estimate Std. Error t value Pr(>|t|)
(Intercept)                       56.75      23.56   2.408    0.138
CityNew York                     -14.50      27.21  -0.533    0.647
Gender1                           -2.25      35.99  -0.063    0.956
Employment:WorktypeBluecolor      25.00      43.02   0.581    0.620
Employment:WorktypeUnemployed        NA         NA      NA       NA
Employment:WorktypeWhitecolor     57.75      35.99   1.604    0.250
Employment:Holiday1 day a week   -50.00      54.42  -0.919    0.455
Employment:Holiday2 day a week       NA         NA      NA       NA

这似乎是由于 'Worktype' 和 'Holiday' 变量中的 'Unemployment' 值。但是,我不确定为什么 R 不将 Employment:WorktypeUnemployed 显然是 0:Worktype = 0 视为零而不将其从模型中删除。这是因为 R 将 Employment:HolidayUnemployed 设置为基线并且两者完全多重共线性吗? (我不得不为 'Worktype' 和 'Holiday' 设置 'Unemployed' 值,因为我想看看 'Worktype' 和 'Holiday' 与 'Unemployed' 人相比的效果. 如果我删除 'Unemployed' 值 NA 消失,但基线将是 'Whitecolor' 和“每周 1 天”所以我看不到与 'unemployed' 相比的效果。)

如果是这样,为什么 'Employement:Holiday2 day a week' 的系数得到 NA?好像跟'Unemployed'值没有关系。

我可以在仅删除 NA 系数的情况下依靠这个结果吗?

以下是可重现的代码。

Happiness <- c(60, 80, 39, 40, 69, 90, 100, 30)

City <- as.factor(c("New York", "Chicago", "Chicago", "New York", "Chicago",         
                  "Chicago", "New York", "New York"))
Gender <- as.factor(c(0, 1, 0, 1, 1, 1, 0, 1)) # 0 = man, 1 = woman.
Employment <- c(0,1, 0, 0, 1 ,1 , 1 , 1) # 0 = unemployed, 1 = employed.
Worktype <- as.factor(c("Unemployed", "Whitecolor", "Unemployed",     
          "Unemployed", "Bluecolor", "Bluecolor", "Whitecolor","Whitecolor"))
Holiday <- as.factor(c(0, 1, 0, 0, 2, 2, 2, 1))
levels(Holiday) <- c("Unemployed", "1 day a week", "2 day a week")

data <- data.frame(Happiness, City, Gender, Employment, Worktype, Holiday)

head(data,8)
str(data)

reg <- lm(Happiness ~ City + Gender + Employment:(Worktype + Holiday))
summary(reg)

您不必担心 Employment:WorktypeUnemployed 的 NA 值。 R 尝试自动计算所有交互项,但该特定系数仍未确定,因为很明显,Employment=1 和 Worktype="Unemployed" 永远不会出现这种情况。它对其他系数的计算没有任何影响:您可以通过手动编码虚拟变量来验证:

> library(lme4) # for the convenient "dummy" function 
> data <- data.frame(data, 
+   dummy(Worktype, c("Bluecolor","Whitecolor")), 
+   h1=dummy(Holiday)[,1], 
+   h2=dummy(Holiday)[,2])
>   
> reg <- lm(Happiness ~ City + Gender + Employment:Bluecolor + Employment:Whitecolor  + Employment:h1 + Employment:h2 , data)
> summary(reg)

Call:
lm(formula = Happiness ~ City + Gender + Employment:Bluecolor + 
    Employment:Whitecolor + Employment:h1 + Employment:h2, data = data)

Residuals:
         1          2          3          4          5          6          7          8 
 1.775e+01  1.775e+01 -1.775e+01  8.882e-16 -1.050e+01  1.050e+01  4.441e-15 -1.775e+01 

Coefficients: (1 not defined because of singularities)
                      Estimate Std. Error t value Pr(>|t|)
(Intercept)              56.75      23.56   2.408    0.138
CityNew York            -14.50      27.21  -0.533    0.647
Gender1                  -2.25      35.99  -0.063    0.956
Employment:Bluecolor     25.00      43.02   0.581    0.620
Employment:Whitecolor    57.75      35.99   1.604    0.250
Employment:h1           -50.00      54.42  -0.919    0.455
Employment:h2               NA         NA      NA       NA

Residual standard error: 27.21 on 2 degrees of freedom
Multiple R-squared:  0.6798,    Adjusted R-squared:  -0.1208 
F-statistic: 0.8491 on 5 and 2 DF,  p-value: 0.619

即使 Employment:WorktypeUnemployed 不再存在,估计的系数也相同。

但是,Employment:h2 的 NA 值仍然存在(相当于 Employment:Holiday2 day a week)。这似乎是因为在这个减少的数据集中你最终得到一个单一的模型矩阵(即一列是其他列的线性组合)

> solve(crossprod(model.matrix(reg)))
Error in solve.default(crossprod(model.matrix(reg))) : 
  system is computationally singular: reciprocal condition number = 1.79897e-18

所以这个问题可能不会出现在更大的数据集上。最终,您可以尝试删除模型中的任何冗余(例如,是否有每周假期为 0 天的雇员?如果没有,那么 1 天应该是基线,您可以为假期天数添加额外的列 > 1).您可以使用 alias() 函数来检查出现问题的术语。