R 的 lm 函数如何处理因子水平(C_Cdqrls?)?

How does R's lm function handle factor levels (in C_Cdqrls?)?

或者换句话说:在这种情况下使用哪种算法?我猜他们使用的是判别分析,例如在第 4.4 章中。在詹姆斯等。阿尔。 "An Introduction to Statistical Learning with Applications in R"?

在评论输入后,我还可以重述问题如下:

我在这个 article 中找到了非常有用的解释,但在某些时候它只说明了

... This [factor] deconstruction can be a complex task, so we will not go into details lest it take us too far afield...

我在文档中找不到任何内容,也无法理解使用 debug(lm) 发生了什么 我使用可重现示例的理解:

n <- 10
p <- 6
set.seed(1)
x <- seq(0, 20, length.out = n) + rnorm(n, 0, 1)
y <- c(1:3)
y <- sample(y, n, replace = TRUE)
z <- 10*y*x + 10*y + 10 + rnorm(n, 0, 1)
debug(lm)
fit <- lm(z ~ x*y)

mt <- attr(mf, "terms") 之后看起来像

mt
# ...
# attr(,"dataClasses")
#         z         x         y 
# "numeric" "numeric" "numeric" 

而在

之后
n <- 10
p <- 6
set.seed(1)
x <- seq(0, 20, length.out = n) + rnorm(n, 0, 1)
y <- c(1:3)
y <- sample(y, n, replace = TRUE)
z <- 10*y*x + 10*y + 10 + rnorm(n, 0, 1)
y <- as.factor(y)
debug(lm)
fit <- lm(z ~ x*y)

mt <- attr(mf, "terms") 看起来像

mt
# ...
# attr(,"dataClasses")
#         z         x         y 
# "numeric" "numeric"  "factor"

但后来看来,他们总是调用 lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) 并且 z <- .Call(C_Cdqrls, x, y, tol, FALSE) 我认为只有在没有因素的情况下才有效。 上面的 link 很好地解释了模型矩阵和 qr 分解的所有内容,我认为这在因子的情况下不起作用。

编辑: x <- model.matrix(mt, mf, contrasts)之后的模型矩阵已经不同了。如果是数字

x
   (Intercept)          x y       x:y
1            1 -0.6264538 3 -1.879361
2            1  2.4058655 1  2.405866
3            1  3.6088158 2  7.217632
4            1  8.2619475 1  8.261947
5            1  9.2183967 1  9.218397
6            1 10.2906427 2 20.581285
7            1 13.8207624 1 13.820762
8            1 16.2938803 2 32.587761
9            1 18.3535591 3 55.060677
10           1 19.6946116 2 39.389223
attr(,"assign")
[1] 0 1 2 3

万一因素

x
   (Intercept)          x y2 y3      x:y2       x:y3
1            1 -0.6264538  0  1  0.000000 -0.6264538
2            1  2.4058655  0  0  0.000000  0.0000000
3            1  3.6088158  1  0  3.608816  0.0000000
4            1  8.2619475  0  0  0.000000  0.0000000
5            1  9.2183967  0  0  0.000000  0.0000000
6            1 10.2906427  1  0 10.290643  0.0000000
7            1 13.8207624  0  0  0.000000  0.0000000
8            1 16.2938803  1  0 16.293880  0.0000000
9            1 18.3535591  0  1  0.000000 18.3535591
10           1 19.6946116  1  0 19.694612  0.0000000
attr(,"assign")
[1] 0 1 2 2 3 3
attr(,"contrasts")
attr(,"contrasts")$`y`
[1] "contr.treatment"

编辑2:部分问题也可以找到here

this question 的答案的帮助下,我意识到答案很简单:

如果因子属于变量(预测变量),model.matrix 只会变大。因此很明显,C_Cdqrls 可以处理模型矩阵。

仅当因变量包含因素时,线性回归或 lm 无法正常工作,判别分析是一种可能性。 (乍一看,stats::glm 使用的是 logit 模型。

来自Wikipedia

Discriminant function analysis is very similar to logistic regression, and both can be used to answer the same research questions. Logistic regression does not have as many assumptions and restrictions as discriminant analysis. However, when discriminant analysis’ assumptions are met, it is more powerful than logistic regression. Unlike logistic regression, discriminant analysis can be used with small sample sizes. It has been shown that when sample sizes are equal, and homogeneity of variance/covariance holds, discriminant analysis is more accurate. With all this being considered, logistic regression has become the common choice, since the assumptions of discriminant analysis are rarely met.


示例:

x <- seq(0, 10, length.out = 21)
y <- c(0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
y <- as.factor(y)
df <- data.frame(x = x, y = y)

# see ??numeric and the ‘Warning’ section in factor:
plot(x, as.numeric(levels(y))[y], ylim = c(0, 1.2))

fit <- lm(y ~ x, data = df)
print(summary(fit))

fit_glm <- stats::glm(y ~ x, family = binomial(link = "logit"), data = df, control = list(maxit = 50))
print(summary(fit_glm))

df$glm.probs <- stats::predict(fit_glm, newdata = df, type = "response")
df$glm.pred = ifelse(glm.probs > 0.5, 1, 0)
points(x, df$glm.pred + 0.05, col = "red")