如何使用矩阵列的均值作为 R 中线性回归的预测值？

Question

问题陈述： 60 个汽油样品的一些近红外光谱和相应的 octane 数字可以通过 data(gasoline, package="pls"). 计算每个的平均值频率并使用问题 4 中的五种不同方法预测最佳模型的响应。

注意：这是 Julian Faraway 在 Linear Models with R，第二版， 中的练习 11.5。此外，“问题 4 中的五种不同方法”是：具有所有预测变量的线性回归、使用 AIC 选择变量的线性回归、主成分回归、偏最小二乘法和岭回归。

我目前的工作：我们做

require(pls)
data(gasoline, package="pls")
test_index = seq(1,nrow(gasoline),10)
train_index = 1:nrow(gasoline)
train_index = train_index[!train_index %in% test_index]
train_gas = gasoline[train_index,]
test_gas = gasoline[test_index,]
lmod = lm(octane~NIR,train_gas)

到目前为止，还不错。但是，如果我查看模型的摘要，我发现由于奇异性而未定义 348 个系数。（为什么？）此外，事实证明，将 NIR 矩阵（预测变量）的列的平均值调整到可接受的数据框中很困难。

我的问题： 我怎样才能到达 highly-fussy predict 函数让我做这样的事情的地步：

new_data = apply(train_gas$NIR, 2, mean)
*some code here*
predict(lmod, new_data)

?

顺便说一下，由于我在 Stats.SE 上做了大量的审核，我可以肯定地说这个问题将在 Stats.SE 上作为 off-topic 结束。这是“编程或数据请求”，因此在 Stats.SE.

上不受欢迎

我也在 SO 上查找了一些相关问题，但似乎没有什么是完全适合的。

Answer 1

这对我来说确实很漂亮 CrossValidated-ish ... gasoline 是一个相当奇怪的对象，包含一个 'column' （元素），它是一个 401 列矩阵：

data.frame':    60 obs. of  2 variables:
 $ octane: num  85.3 85.2 88.5 83.4 87.9 ...
 $ NIR   : 'AsIs' num [1:60, 1:401] -0.0502 -0.0442 -0.0469 -0.0467 -0.0509 ...

但是，根本的问题是，这是一个p>>n的问题；有 60 个观察值和 401 个预测变量。因此，标准的线性回归可能没有意义——您可能想使用像 LASSO/ridge（即 glmnet）这样的惩罚方法。这就是为什么你得到未定义的系数（没有某种惩罚，你不能从 60 个观测值估计 402 个系数（ncols + 1 用于截距）...）

但是，如果我们确实想将其破解成可以进行线性模型和预测的形状（但是 ill-advised）：

NIR <- gasoline$NIR
class(NIR) <- "matrix" ## override "AsIs" class
g2 <- data.frame(octane = gasoline$octane, NIR)
dim(g2) ## 60 402 - now this is a 'regular' data frame

## using train_index from above
train_gas <- g2[train_index,]
lmod = lm(octane~., train_gas)
## drop first column (response); use `lapply()` to maintain list structure
new_data <- as.data.frame(lapply(train_gas[-1], mean))
predict(lmod, new_data)
##        1 
## 87.16019 
## Warning message:
## In predict.lm(lmod, new_data) :
##   prediction from a rank-deficient fit may be misleading

一个稍微更直接的方法（但仍然很难看）是将模型拟合到原始怪异结构并构建一个匹配该怪异结构的预测框架，即

pp <- data.frame(NIR=I(matrix(colMeans(train_gas$NIR), nrow = 1)))

如果你愿意放弃 predict() 你可以这样做：

sum(na.omit(coef(lmod) * c(1, colMeans(train_gas$NIR))))

如何使用矩阵列的均值作为 R 中线性回归的预测值？

How do I use the means of the columns of a matrix as prediction values in a linear regression in R?

r

matrix

prediction

linear-regression