R 中的 glmnet 插入符号 - 如何检查二元逻辑 LASSO 模型性能而不会出错?

glmnet caret in R - How to check binary logistic LASSO model performance without error?

我正在尝试使用 R 的 caret 和 glmnet 包来 运行 LASSO 来确定感兴趣的二元结果的最佳预测因子。

我一直在检查训练模型的性能(从预测中提取均方根误差和 R 平方值),但出现以下错误:

cor(obs, pred, use = ifelse(na.rm, "complete.obs", "everything")) 错误:'x' 必须是数字

谁能帮我弄清楚为什么我的代码会抛出这个错误?如何成功提取 RMSE 和 R^2 值?

下面的示例代码抛出相同的错误。我包括了我的所有步骤,因此您可以通过 LASSO 回归了解我的想法。如果你想跳到最后,最后一个chunk就是问题了。


set.seed(12345)

# Create toy data frame
toydata = data.frame(status = factor(sample(c('pos', 'neg'), 100, replace=TRUE)),
                x1=runif(100, 1, 15),
                x2=runif(100, 1, 15),
                x3 = runif(100, 1, 15),
                x4 = runif(100, 1, 15),
                x5 = runif(100, 1, 15),
                x6 = runif(100, 1, 15),
                x7 = runif(100, 1, 15),
                x8 = runif(100, 1, 15),
                x9 = runif(100, 1, 15),
                x10 = runif(100, 1, 15),
                x11 = runif(100, 1, 15),
                x12 = runif(100, 1, 15),
                x13 = runif(100, 1, 15),
                x14 = runif(100, 1, 15))



### Partition the data
library(caret)

set.seed(12345)

# Partition (split) and create index matrix of selected values
index <- createDataPartition(toydata$status,
                             p = .8, # 80% of cases assigned to object
                             list = FALSE, # Want a matrix, not a list
                             times = 1) # Only split data once


# Create training and test data frames
train <- toydata[index,] # Select values df2_LASSO by rows in index object
test <- toydata[-index,] # Retain only cases NOT in index matrix




# Specify k-fold cross-validation as a training method (framework)
ctrlspecs <- trainControl(method = "cv", # Cross-validation
                          number = 2, # Specify number of folds
                          savePredictions = "all") # Save all predictions


### Specify & Train LASSO Regression Model

# Create a vector of potential lambda values
  # Range provided here is kind of overkill, but good for refinement.
lambda_vector <- 10^seq(5,-5, length=500)


set.seed(12345)

# Specify LASSO regression model to be estimated using the training data and 2-fold cross-validation framework/process

model_LASSO <- train(status ~ ., # . means "all others vectors"
                     data = train,
                     preProcess = c("center","scale"), # Grand mean center and standardize variables
                     method = "glmnet", # Method for LASSO regression
                     tuneGrid = expand.grid(alpha = 1, # Mixing percentage. Constant
                                            lambda = lambda_vector), # DF for model to test tuning parameters
                     trControl = ctrlspecs, # Train LASSO using k-fold cross-validation
                     na.action = na.omit, # If NAs encountered, use listwise deletion.
                     family = "binomial")
                     

# Best (optimal) tuning parameters (alpha, lambda)
  # Optimal lambda = 0.03524473
    # Best tuning parameter to minimize the root mean squared error (RMSE) of model
model_LASSO$bestTune
model_LASSO$bestTune$lambda # Directly access best lambda

# LASSO regression model coefficients (parameter estimates)
coef(model_LASSO$finalModel, # Select the final model coefficients
     model_LASSO$bestTune$lambda) # at the best lambda value

# Plot log(lambda) & RMSE
plot(log(model_LASSO$results$lambda),
     model_LASSO$results$RMSE,
     xlab = "log(lambda)",
     ylab = "RMSE")

# Variable importance
varImp(model_LASSO)

# Data visualization of variable importance
# install.packages("ggplot2")
library(ggplot2)
ggplot(varImp(model_LASSO))

### Model prediction
  # Goal: See how well our model predicts when we give it new data
predictions_LASSO <- predict(model_LASSO, # Use trained model
                             newdata = test) # To predict outcome with test data

# Model performance/accuracy
model_LASSO_perf <- data.frame(RMSE = RMSE(predictions_LASSO, 
                                           test$status),
                               Rsquared = R2(predictions_LASSO,
                                             test$status))

发生这种情况只是因为 RMSE 和 R 平方对于因子结果没有意义。您必须使用 caret::confusionMatrix 或将因子转换为整数(我认为这不是一个很好的选择):

confusionMatrix(predictions_LASSO,test$status)

model_LASSO_perf <- data.frame(RMSE = RMSE(as.integer(predictions_LASSO), 
                                           as.integer(test$status)),
                               Rsquared = R2(as.integer(predictions_LASSO),
                                             as.integer(test$status)))