使用 GLMNET 和 CARET 预测新数据 - newx 中的变量数必须为 X

Question

我有一个数据集，我正在用它进行 k 折交叉验证。

在每一折中，我都将数据拆分为训练数据集和测试数据集。

对于数据集X的训练，我运行下面的代码：

cv_glmnet <- caret::train(x = as.data.frame(X[curtrainfoldi, ]), y = y[curtrainfoldi, ],
                       method = "glmnet",
                       preProcess = NULL,
                       trControl = trainControl(method = "cv", number = 10),
                       tuneLength = 10)

我检查'cv_glmnet'的class，返回'train'。

然后我想使用这个模型来预测测试数据集中的值，这是一个具有相同数量变量（列）的矩阵

# predicting on test data 
yhat <- predict.train(cv_glmnet, newdata = X[curtestfoldi, ])

然而，我一直运行宁到以下错误：

Error in predict.glmnet(modelFit, newdata, s = modelFit$lambdaOpt, type = "response") : 
  The number of variables in newx must be 210

我在 caret.predict 文档中注意到，它说明如下：

newdata an optional set of data to predict on. If NULL, then the original training data are used but, if the train model used a recipe, an error will occur.

我很困惑为什么我运行会陷入这个错误。这与我定义新数据的方式有关吗？我的数据有 variables/columns 的正确数量（与火车数据集相同），所以我不知道是什么导致了错误。

Answer 1

您收到错误是因为您的列名在您通过 as.data.frame(X) 时发生了变化。如果您的矩阵没有列名，它会创建列名，并且模型在尝试预测时需要这些。如果它有列名，那么其中一些可以更改：

library(caret)
library(tibble)

X =  matrix(runif(50*20),ncol=20)
y = rnorm(50)

cv_glmnet <- caret::train(x = as.data.frame(X), y = y,
                       method = "glmnet",
                       preProcess = NULL,
                       trControl = trainControl(method = "cv", number = 10),
                       tuneLength = 10)

yhat <- predict.train(cv_glmnet, newdata = X) 

Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
  There were missing values in resampled performance measures.
Error in predict.glmnet(modelFit, newdata, s = modelFit$lambdaOpt) : 
  The number of variables in newx must be 20

如果你有列名，就可以了

colnames(X) = paste0("column",1:ncol(X))
cv_glmnet <- caret::train(x = as.data.frame(X), y = y,
                       method = "glmnet",
                       preProcess = NULL,
                       trControl = trainControl(method = "cv", number = 10),
                       tuneLength = 10)

yhat <- predict.train(cv_glmnet, newdata = X)

使用 GLMNET 和 CARET 预测新数据 - newx 中的变量数必须为 X

Prediction on new data with GLMNET and CARET - The number of variables in newx must be X

r

predict

glmnet

r-caret