r 中的套索特征选择

Question

我想使用逻辑回归（我的输出是分类的）对select我数据集中的重要变量"data"执行套索回归，然后select这些重要变量"variables" 并在验证集 x.test 上测试它们，并将预测值与实际值进行比较，但我得到了这个错误： cbind2(1, newx) %*% nbeta 错误： Erreur Cholmod 'X and/or Y have wrong dimensions' dans le fichier ../MatrixOps/cholmod_sdmult.c, ligne 90

 library(glmnet)
library(caret)
# class label must be factor 0 noevent, 1:anomalous
iris$Species<-ifelse(iris$Species=="setosa",0,1)
#data$Cardio1M=factor(data$Cardio1M)
#split data into train and test
trainIndex <- createDataPartition(iris$Species, p=0.7, list=FALSE)
data_train <- iris[ trainIndex,]
data_test <- iris[-trainIndex,]
x.train <- data.matrix (data_train [ ,1:ncol(data_train)-1])
y.train <- data.matrix (data_train$Species)
x.test <- data.matrix (data_test [,1:(ncol(data_test))-1])
y.test <- data.matrix(data_test$Species)
#fitting generalized linear modelalpha=0 then ridge regression is used, while if alpha=1 then the lasso
# of ?? values (the shrinkage coefficient)
#Associated with each value of ?? is a vector of regression coefficients. For example, the 100th value of ??, a very small one, is closer to perform least squares:
Lasso.mod <- glmnet(x.train, y.train, alpha=1, nlambda=100, lambda.min.ratio=0.0001,family="binomial")
#use 10 fold cross-validation to choose optimal ??.
set.seed(1)
#cv.out <- cv.glmnet(x, y, alpha=1,family="binomial", nlambda=100, lambda.min.ratio=0.0001,type.measure = "class")
cv.out <- cv.glmnet(x.train, y.train, alpha=1,family="binomial", nlambda=100, type.measure = "class")
#Ploting the misclassification error and the diferent values of lambda
plot(cv.out)
best.lambda <- cv.out$lambda.min
best.lambda
co<-coef(cv.out, s = "lambda.min")
#Once we have the best lambda, we can use predict to obtain the coefficients.
p<-predict(Lasso.mod, s=best.lambda, type="coefficients")[1:6, ]
p

我想测试 selected 功能是否有助于减少我测试集上的错误，但即使使用 iris 数据集我也有错误

#Selection of the significant features(predictors)
inds<-which(co!=0)
variables<-row.names(co)[inds]
variables<-variables[!(variables %in% '(Intercept)')];
#predict output values based on selected predictors
p <- predict(cv.out, s=best.lambda, newx=x.test,type="class")
# Calculate accuracy
Accuracy<- mean(p==y.test)

Answer 1

我试着发表评论解释出了什么问题，但是太长了，所以我必须 post 一个答案。另外，我知道以下是您出错的原因，但没有可重现的示例，我不能保证没有其他问题。

主要问题是您使用的是 x.test[, variables] 而不是 x.test。对象 cv.out 包含所有变量名称，包括已减少为 0 的变量名称，因此 predict 命令不知道在哪里可以找到它们，因为您将 x.test 子集化为仅包含具有显着系数的变量。

就算是这样，也是不行的。原因是您使用 s = "lambda.min" 获得了显着系数，但随后您试图使用 s=cv.out$lambda.1se 进行预测。问题是如果一些变量，例如X2，在 lambda.min 模型中被归零，它在 lambda.1se 模型中可能仍然很重要。所以当 predict 命令试图在 x.test 中找到它时，它找不到，因为它不在 variables.

中

所以最后，你应该做的是：

p <- predict(Lasso.mod, s=best.lambda, newx=x.test, type="class")

您的代码还存在其他问题，但我认为它们不会导致错误消息。希望对您有所帮助！

主要更新

您还应该解决的问题是：

创建 x.test 和 x.train 时，将 length 更改为 ncol。实际上在这两种情况下你都需要 data_test [,1:(ncol(data_test))-1]。尽管 length 和 ncol 在这种情况下会给你相同的数字，但如果它是矩阵而不是 data.frame，它们就不会。此外，您需要 -1 部分，否则您会在 x.

y

创建p时将type="response"改为type=class"，否则得到的Accuracy为0。（我在上面的代码中改过了）

r 中的套索特征选择

Lasso feature selection in r

r

feature-selection

主要更新