使用 LASSO 进行 R 特征选择
R feature selection with LASSO
我有一个小数据集(37 个观察值 x 23 个特征)并且想使用 LASSO 回归执行特征选择以降低其维度。为此,我根据在线教程设计了以下代码
#Load the libraries
library(mlbench)
library(elasticnet)
library(caret)
#Initialize cross validation and train LASSO
cv_5 <- trainControl(method="cv", number=5)
lasso <- train( ColumnY ~., data=My_Data_Frame, method='lasso', trControl=cv_5)
#Filter out the variables whose coefficients have squeezed to 0
drop <-predict.enet(lasso$finalModel, type='coefficients', s=lasso$bestTune$fraction, mode='fraction')$coefficients
drop<-drop[drop==0]%>%names()
My_Data_Frame<- My_Data_Frame%>%select(-drop)
在大多数情况下,代码运行没有错误,但偶尔会抛出以下错误:
Warning messages:
1: model fit failed for Fold2: fraction=0.9 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
2: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
我感觉到发生这种情况是因为我的数据行数很少,而且一些变量的方差很小。
有什么方法可以绕过或解决这个问题(例如在流程中设置参数)?
您的观测值较少,因此在某些训练集中,您的某些列很可能全为零,或者方差非常低。例如:
library(caret)
set.seed(222)
df = data.frame(ColumnY = rnorm(37),matrix(rbinom(37*23,1,p=0.15),ncol=23))
cv_5 <- trainControl(method="cv", number=5)
lasso <- train( ColumnY ~., data=df, method='lasso', trControl=cv_5)
Warning messages:
1: model fit failed for Fold4: fraction=0.9 Error in elasticnet::enet(as.matrix(x), y, lambda = 0, ...) :
Some of the columns of x have zero variance
在下方 运行 之前,检查分类列是否都没有只有 1 个正标签..
一种方法是增加 cv 倍数,如果设置为 5,则使用了 80% 的数据。尝试 10 次以使用 90% 的数据:
cv_10 <- trainControl(method="cv", number=10)
lasso <- train( ColumnY ~., data=df, method='lasso', trControl=cv_10)
正如您可能已经看到的那样...由于数据集非常小,交叉验证可能不会为您提供太多优势,您也可以不考虑交叉验证:
tr <- trainControl(method="LOOCV")
lasso <- train( ColumnY ~., data=df, method='lasso', trControl=tr)
您可以使用 FSinR 包进行特征选择。它在 R 中,可从 CRAN 访问。它有各种各样的过滤器和包装器方法,您可以将它们与搜索方法结合使用。生成包装器评估器的接口遵循插入符接口。例如:
# Load the library
library(FSinR)
# Choose one of the search methods
searcher <- searchAlgorithm('sequentialForwardSelection')
# Choose one of the filter/wrapper evaluators (You can remove the fitting and resampling params if you want to make it simpler)(These are the parameters of the train and trainControl of caret)
resamplingParams <- list(method = "cv", number = 5)
fittingParams <- list(preProc = c("center", "scale"), metric="Accuracy", tuneGrid = expand.grid(k = c(1:20)))
evaluator <- wrapperEvaluator('knn', resamplingParams, fittingParams)
# You make the feature selection (returns the best features)
results <- featureSelection(My_Data_Frame, 'ColumnY', searcher, evaluator)
我有一个小数据集(37 个观察值 x 23 个特征)并且想使用 LASSO 回归执行特征选择以降低其维度。为此,我根据在线教程设计了以下代码
#Load the libraries
library(mlbench)
library(elasticnet)
library(caret)
#Initialize cross validation and train LASSO
cv_5 <- trainControl(method="cv", number=5)
lasso <- train( ColumnY ~., data=My_Data_Frame, method='lasso', trControl=cv_5)
#Filter out the variables whose coefficients have squeezed to 0
drop <-predict.enet(lasso$finalModel, type='coefficients', s=lasso$bestTune$fraction, mode='fraction')$coefficients
drop<-drop[drop==0]%>%names()
My_Data_Frame<- My_Data_Frame%>%select(-drop)
在大多数情况下,代码运行没有错误,但偶尔会抛出以下错误:
Warning messages:
1: model fit failed for Fold2: fraction=0.9 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
2: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
我感觉到发生这种情况是因为我的数据行数很少,而且一些变量的方差很小。 有什么方法可以绕过或解决这个问题(例如在流程中设置参数)?
您的观测值较少,因此在某些训练集中,您的某些列很可能全为零,或者方差非常低。例如:
library(caret)
set.seed(222)
df = data.frame(ColumnY = rnorm(37),matrix(rbinom(37*23,1,p=0.15),ncol=23))
cv_5 <- trainControl(method="cv", number=5)
lasso <- train( ColumnY ~., data=df, method='lasso', trControl=cv_5)
Warning messages:
1: model fit failed for Fold4: fraction=0.9 Error in elasticnet::enet(as.matrix(x), y, lambda = 0, ...) :
Some of the columns of x have zero variance
在下方 运行 之前,检查分类列是否都没有只有 1 个正标签..
一种方法是增加 cv 倍数,如果设置为 5,则使用了 80% 的数据。尝试 10 次以使用 90% 的数据:
cv_10 <- trainControl(method="cv", number=10)
lasso <- train( ColumnY ~., data=df, method='lasso', trControl=cv_10)
正如您可能已经看到的那样...由于数据集非常小,交叉验证可能不会为您提供太多优势,您也可以不考虑交叉验证:
tr <- trainControl(method="LOOCV")
lasso <- train( ColumnY ~., data=df, method='lasso', trControl=tr)
您可以使用 FSinR 包进行特征选择。它在 R 中,可从 CRAN 访问。它有各种各样的过滤器和包装器方法,您可以将它们与搜索方法结合使用。生成包装器评估器的接口遵循插入符接口。例如:
# Load the library
library(FSinR)
# Choose one of the search methods
searcher <- searchAlgorithm('sequentialForwardSelection')
# Choose one of the filter/wrapper evaluators (You can remove the fitting and resampling params if you want to make it simpler)(These are the parameters of the train and trainControl of caret)
resamplingParams <- list(method = "cv", number = 5)
fittingParams <- list(preProc = c("center", "scale"), metric="Accuracy", tuneGrid = expand.grid(k = c(1:20)))
evaluator <- wrapperEvaluator('knn', resamplingParams, fittingParams)
# You make the feature selection (returns the best features)
results <- featureSelection(My_Data_Frame, 'ColumnY', searcher, evaluator)