R 的 lm() 函数在不报告错误或警告的情况下删除因子水平（截距估计器除外）

Question

我正在尝试使用前向逐步方差分析和 AIC 选择标准将线性模型拟合到相当大的不平衡数据集，并带有交互项。有 13,072 个观察值。这是它的设置方式：

响应变量 dayseclosion 为连续数值

解释变量都是分类的：host（4个水平），site（25个水平），year（5个水平） , monoverwinter (4 个等级).

> glimpse(dat)
Observations: 13,072
Variables: 5
$ site          <chr> "10", "10", "10", "10", "10", "14", "14", "15", "15",…
$ year          <chr> "2014", "2014", "2014", "2014", "2014", "2017", "2017…
$ host          <chr> "3", "3", "4", "3", "3", "1", "1", "1", "1", "1", "1"…
$ monoverwinter <chr> "6", "6", "6", "6", "6", "5", "5", "5", "5", "5", "5"…
$ dayseclose    <dbl> 11, 12, 17, 19, 19, 20, 20, 20, 20, 21, 21, 21, 21, 2…

m0 –> lm(dayseclose ~ 1, data = dat)
model.aic.forward –> step(m0, direction = "forward", trace = 1, scope = ~ host * monoverwinter * site * year)

现在，当我说数据不平衡时，我的意思是并非所有 4host 都被收集，所有 25site 全部 5year，并且存在不平等的代表性4monoverwinter 实验处理等因素水平。但是，每个因子水平内仍有大量观察值（=数百个）。

一切似乎运行都很好——没有警告也没有错误。选择以下型号：

## Step:  AIC=61561.31
## dayseclose ~ host + site + monoverwinter + year + host:site + 
##     host:monoverwinter + site:monoverwinter + site:year
## 
##                           Df Sum of Sq     RSS   AIC
## <none>                                 1433157 61561
## + host:year                1     8.059 1433149 61563
## + host:monoverwinter:site  3   242.885 1432914 61565

问题是当我检查 summary() 和 anova() table 时，它揭示了与year 被神秘删除 (year2020)。注意，这不是用于估计截距的级别（即 year2016）。那一年有 3,581 个观测值，但在 summary(model.aic.forward) table 中，系数为（仅显示部分）：

## year2017                           6.36787    2.44775   2.602 0.009292 ** 
## year2018                          -0.13757    1.85568  -0.074 0.940906    
## year2019                         -10.56667    3.45693  -3.057 0.002243 ** 
## year2020                                NA         NA      NA       NA

此处也未显示，但与 year2020 的所有交互也具体显示为 NA。

奇怪的是，根据 F-stat 自由度，似乎所有观察结果，包括 year2020 都被用于拟合模型 (79 + 12992 = 13071)：

## Residual standard error: 10.5 on 12992 degrees of freedom
## Multiple R-squared:  0.3105, Adjusted R-squared:  0.3063 
## F-statistic: 74.06 on 79 and 12992 DF,  p-value: < 2.2e-16

最后（我知道这很长），方差分析 table 中 year 的 df 是 3，但应该是 4，给定数据中的五个因子水平：

d =anova(model.aic.forward)
as_tibble(d, rownames = "Predictors")

## # A tibble: 9 x 6
##   Predictors             Df `Sum Sq` `Mean Sq` `F value`  `Pr(>F)`
##   <chr>              <int>    <dbl>     <dbl>     <dbl>     <dbl>
## 1 host                   3  497707.   165902.   1504.    0.      
## 2 site                  24   43276.     1803.     16.3   3.77e-67
## 3 monoverwinter          3   34395.    11465.    104.    1.73e-66
## 4 year                   3    2422.      807.      7.32  6.71e- 5
## 5 host:site             16   43445.     2715.     24.6   1.08e-72
## 6 host:monoverwinter     3    7319.     2440.     22.1   2.80e-14
## 7 site:monoverwinter    12    9229.      769.      6.97  9.13e-13
## 8 site:year             15    7646.      510.      4.62  6.30e- 9
## 9 Residuals          12992 1433157.      110.     NA    NA

我是不是理解错了？ year2020 发生了什么事？数据的不平衡性是否会导致这种情况？我无法想象如何提供一个最小的可重现示例，因为可能是导致问题的数据的数量和复杂性？

感谢您阅读并可能帮助解决这个难题。

Answer 1

是的，您的缺失可能导致了您的结果。

考虑以下可重现的示例。

x1 <- sample(1:5, 1000, replace=T)  
x2 <- sample(1:3, 1000, replace=T)
y <- 2*x1 + 3*x2 + rnorm(1000)
#no missings (everything works fine)
lm(y~as.factor(x1) + as.factor(x2))
lm(y~ -1 + as.factor(x1) + as.factor(x2)) # no intercept
#with missings
x1[x2==2]<-NA #create specific missingness
table(x1,useNA = "always")
table(x2,useNA = "always")
table(x1,x2,useNA = "always") #you see the missing pattern
lm(y~ as.factor(x1) + as.factor(x2))
lm(y~ -1 + as.factor(x1) + as.factor(x2))

如果您运行代码，您会看到因子 x2 省略了两个类别。第一个是ref.cat。第二个 (x2)2 因为缺失。

R 的 lm() 函数在不报告错误或警告的情况下删除因子水平（截距估计器除外）

R's lm() function removes factor levels (aside from intercept estimator) without reporting error or warning

r

anova

lm