使用 LASSO 进行 R 特征选择

R feature selection with LASSO

我有一个小数据集(37 个观察值 x 23 个特征)并且想使用 LASSO 回归执行特征选择以降低其维度。为此,我根据在线教程设计了以下代码

#Load the libraries
library(mlbench)
library(elasticnet)
library(caret)

#Initialize cross validation and train LASSO
cv_5 <- trainControl(method="cv", number=5)
lasso <- train( ColumnY ~., data=My_Data_Frame, method='lasso',  trControl=cv_5)

#Filter out the variables whose coefficients have squeezed to 0
drop <-predict.enet(lasso$finalModel, type='coefficients', s=lasso$bestTune$fraction, mode='fraction')$coefficients  
drop<-drop[drop==0]%>%names()
My_Data_Frame<- My_Data_Frame%>%select(-drop) 

在大多数情况下,代码运行没有错误,但偶尔会抛出以下错误:

Warning messages:
1: model fit failed for Fold2: fraction=0.9 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
 
2: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
  There were missing values in resampled performance measures.

我感觉到发生这种情况是因为我的数据行数很少,而且一些变量的方差很小。 有什么方法可以绕过或解决这个问题(例如在流程中设置参数)?

您的观测值较少,因此在某些训练集中,您的某些列很可能全为零,或者方差非常低。例如:

library(caret)
set.seed(222)
df = data.frame(ColumnY = rnorm(37),matrix(rbinom(37*23,1,p=0.15),ncol=23))

cv_5 <- trainControl(method="cv", number=5)
lasso <- train( ColumnY ~., data=df, method='lasso',  trControl=cv_5)

Warning messages:
1: model fit failed for Fold4: fraction=0.9 Error in elasticnet::enet(as.matrix(x), y, lambda = 0, ...) : 
  Some of the columns of x have zero variance

在下方 运行 之前,检查分类列是否都没有只有 1 个正标签..

一种方法是增加 cv 倍数,如果设置为 5,则使用了 80% 的数据。尝试 10 次以使用 90% 的数据:

cv_10 <- trainControl(method="cv", number=10)
lasso <- train( ColumnY ~., data=df, method='lasso',  trControl=cv_10)

正如您可能已经看到的那样...由于数据集非常小,交叉验证可能不会为您提供太多优势,您也可以不考虑交叉验证:

tr <- trainControl(method="LOOCV")
lasso <- train( ColumnY ~., data=df, method='lasso',  trControl=tr)

您可以使用 FSinR 包进行特征选择。它在 R 中,可从 CRAN 访问。它有各种各样的过滤器和包装器方法,您可以将它们与搜索方法结合使用。生成包装器评估器的接口遵循插入符接口。例如:

# Load the library
library(FSinR)

# Choose one of the search methods
searcher <- searchAlgorithm('sequentialForwardSelection')

# Choose one of the filter/wrapper evaluators (You can remove the fitting and resampling params if you want to make it simpler)(These are the parameters of the train and trainControl of caret)
resamplingParams <- list(method = "cv", number = 5)
fittingParams <- list(preProc = c("center", "scale"), metric="Accuracy", tuneGrid = expand.grid(k = c(1:20)))
evaluator <- wrapperEvaluator('knn', resamplingParams, fittingParams)

# You make the feature selection (returns the best features)
results <- featureSelection(My_Data_Frame, 'ColumnY', searcher, evaluator)