多项式回归：为什么需要在 lm() 之外定义多项式项以避免奇点？

Question

如果您尝试运行多项式回归，其中 x^2 在 lm() 函数中定义，多项式项将由于奇点而被删除。但是，如果我们在 lm() 之外定义多项式项，则模型会正确拟合。

看起来它应该在两种方式下都起作用。 为什么我们需要在lm()函数之外定义多项式项？

x <- round(rnorm(100, mean = 0, sd = 10))
y <- round(x*2.5 + rnorm(100))

# Trying to define x^2 in the model, x^2 is dropped
model_wrong <- lm(y ~ x + x^2)

# Define x^2 as its own object
x2 <- x^2
model_right <- lm(y ~ x + x2)

Answer 1

lm 不知道术语在公式中的开始和结束位置，除非您告诉它，通常是通过将其包装在函数中。对于任意计算，您可以将它们包装在 I(...) 中，它告诉函数按原样使用它：

set.seed(47)
x <- round(rnorm(100, mean = 0, sd = 10))
y <- round(x*2.5 + rnorm(100))

lm(y ~ x + I(x^2))
#> 
#> Call:
#> lm(formula = y ~ x + I(x^2))
#> 
#> Coefficients:
#> (Intercept)            x       I(x^2)  
#>   2.563e-01    2.488e+00   -3.660e-06

实际上，您可以将 x^2 包装在大多数函数调用中，这些函数调用将 return 一个可在模型矩阵中使用的评估向量。在某些情况下 cbind 可能非常方便，尽管 c、identity 甚至 {...} 都可以。不过，I 是专门建造的。

或者，您可以使用 poly 函数为您生成两项，这对于高次多项式非常有用。默认情况下，它生成正交多项式，这将使系数看起来不同：

lm(y ~ poly(x, 2))
#> 
#> Call:
#> lm(formula = y ~ poly(x, 2))
#> 
#> Coefficients:
#> (Intercept)  poly(x, 2)1  poly(x, 2)2  
#>    1.500000   243.485357    -0.004319

即使他们评价相同：

new <- data.frame(x = seq(-1, 1, .5))

predict(lm(y ~ x + I(x^2)), new)
#>          1          2          3          4          5 
#> -2.2317175 -0.9876930  0.2563297  1.5003505  2.7443695

predict(lm(y ~ poly(x, 2)), new)
#>          1          2          3          4          5 
#> -2.2317175 -0.9876930  0.2563297  1.5003505  2.7443695

如果您真的希望系数相同，请添加 raw = TRUE:

lm(y ~ poly(x, 2, raw = TRUE))
#> 
#> Call:
#> lm(formula = y ~ poly(x, 2, raw = TRUE))
#> 
#> Coefficients:
#>             (Intercept)  poly(x, 2, raw = TRUE)1  poly(x, 2, raw = TRUE)2  
#>               2.563e-01                2.488e+00               -3.660e-06

多项式回归：为什么需要在 lm() 之外定义多项式项以避免奇点？

Polynomial regression: Why do you need to define polynomial terms outside of lm() to avoid singularities?

regression

r

quadratic

linear-regression

polynomials