R 中的 glmnet 插入符号 - 如何检查二元逻辑 LASSO 模型性能而不会出错?
glmnet caret in R - How to check binary logistic LASSO model performance without error?
我正在尝试使用 R 的 caret 和 glmnet 包来 运行 LASSO 来确定感兴趣的二元结果的最佳预测因子。
我一直在检查训练模型的性能(从预测中提取均方根误差和 R 平方值),但出现以下错误:
cor(obs, pred, use = ifelse(na.rm, "complete.obs", "everything")) 错误:'x' 必须是数字
谁能帮我弄清楚为什么我的代码会抛出这个错误?如何成功提取 RMSE 和 R^2 值?
下面的示例代码抛出相同的错误。我包括了我的所有步骤,因此您可以通过 LASSO 回归了解我的想法。如果你想跳到最后,最后一个chunk就是问题了。
set.seed(12345)
# Create toy data frame
toydata = data.frame(status = factor(sample(c('pos', 'neg'), 100, replace=TRUE)),
x1=runif(100, 1, 15),
x2=runif(100, 1, 15),
x3 = runif(100, 1, 15),
x4 = runif(100, 1, 15),
x5 = runif(100, 1, 15),
x6 = runif(100, 1, 15),
x7 = runif(100, 1, 15),
x8 = runif(100, 1, 15),
x9 = runif(100, 1, 15),
x10 = runif(100, 1, 15),
x11 = runif(100, 1, 15),
x12 = runif(100, 1, 15),
x13 = runif(100, 1, 15),
x14 = runif(100, 1, 15))
### Partition the data
library(caret)
set.seed(12345)
# Partition (split) and create index matrix of selected values
index <- createDataPartition(toydata$status,
p = .8, # 80% of cases assigned to object
list = FALSE, # Want a matrix, not a list
times = 1) # Only split data once
# Create training and test data frames
train <- toydata[index,] # Select values df2_LASSO by rows in index object
test <- toydata[-index,] # Retain only cases NOT in index matrix
# Specify k-fold cross-validation as a training method (framework)
ctrlspecs <- trainControl(method = "cv", # Cross-validation
number = 2, # Specify number of folds
savePredictions = "all") # Save all predictions
### Specify & Train LASSO Regression Model
# Create a vector of potential lambda values
# Range provided here is kind of overkill, but good for refinement.
lambda_vector <- 10^seq(5,-5, length=500)
set.seed(12345)
# Specify LASSO regression model to be estimated using the training data and 2-fold cross-validation framework/process
model_LASSO <- train(status ~ ., # . means "all others vectors"
data = train,
preProcess = c("center","scale"), # Grand mean center and standardize variables
method = "glmnet", # Method for LASSO regression
tuneGrid = expand.grid(alpha = 1, # Mixing percentage. Constant
lambda = lambda_vector), # DF for model to test tuning parameters
trControl = ctrlspecs, # Train LASSO using k-fold cross-validation
na.action = na.omit, # If NAs encountered, use listwise deletion.
family = "binomial")
# Best (optimal) tuning parameters (alpha, lambda)
# Optimal lambda = 0.03524473
# Best tuning parameter to minimize the root mean squared error (RMSE) of model
model_LASSO$bestTune
model_LASSO$bestTune$lambda # Directly access best lambda
# LASSO regression model coefficients (parameter estimates)
coef(model_LASSO$finalModel, # Select the final model coefficients
model_LASSO$bestTune$lambda) # at the best lambda value
# Plot log(lambda) & RMSE
plot(log(model_LASSO$results$lambda),
model_LASSO$results$RMSE,
xlab = "log(lambda)",
ylab = "RMSE")
# Variable importance
varImp(model_LASSO)
# Data visualization of variable importance
# install.packages("ggplot2")
library(ggplot2)
ggplot(varImp(model_LASSO))
### Model prediction
# Goal: See how well our model predicts when we give it new data
predictions_LASSO <- predict(model_LASSO, # Use trained model
newdata = test) # To predict outcome with test data
# Model performance/accuracy
model_LASSO_perf <- data.frame(RMSE = RMSE(predictions_LASSO,
test$status),
Rsquared = R2(predictions_LASSO,
test$status))
发生这种情况只是因为 RMSE 和 R 平方对于因子结果没有意义。您必须使用 caret::confusionMatrix
或将因子转换为整数(我认为这不是一个很好的选择):
confusionMatrix(predictions_LASSO,test$status)
model_LASSO_perf <- data.frame(RMSE = RMSE(as.integer(predictions_LASSO),
as.integer(test$status)),
Rsquared = R2(as.integer(predictions_LASSO),
as.integer(test$status)))
我正在尝试使用 R 的 caret 和 glmnet 包来 运行 LASSO 来确定感兴趣的二元结果的最佳预测因子。
我一直在检查训练模型的性能(从预测中提取均方根误差和 R 平方值),但出现以下错误:
cor(obs, pred, use = ifelse(na.rm, "complete.obs", "everything")) 错误:'x' 必须是数字
谁能帮我弄清楚为什么我的代码会抛出这个错误?如何成功提取 RMSE 和 R^2 值?
下面的示例代码抛出相同的错误。我包括了我的所有步骤,因此您可以通过 LASSO 回归了解我的想法。如果你想跳到最后,最后一个chunk就是问题了。
set.seed(12345)
# Create toy data frame
toydata = data.frame(status = factor(sample(c('pos', 'neg'), 100, replace=TRUE)),
x1=runif(100, 1, 15),
x2=runif(100, 1, 15),
x3 = runif(100, 1, 15),
x4 = runif(100, 1, 15),
x5 = runif(100, 1, 15),
x6 = runif(100, 1, 15),
x7 = runif(100, 1, 15),
x8 = runif(100, 1, 15),
x9 = runif(100, 1, 15),
x10 = runif(100, 1, 15),
x11 = runif(100, 1, 15),
x12 = runif(100, 1, 15),
x13 = runif(100, 1, 15),
x14 = runif(100, 1, 15))
### Partition the data
library(caret)
set.seed(12345)
# Partition (split) and create index matrix of selected values
index <- createDataPartition(toydata$status,
p = .8, # 80% of cases assigned to object
list = FALSE, # Want a matrix, not a list
times = 1) # Only split data once
# Create training and test data frames
train <- toydata[index,] # Select values df2_LASSO by rows in index object
test <- toydata[-index,] # Retain only cases NOT in index matrix
# Specify k-fold cross-validation as a training method (framework)
ctrlspecs <- trainControl(method = "cv", # Cross-validation
number = 2, # Specify number of folds
savePredictions = "all") # Save all predictions
### Specify & Train LASSO Regression Model
# Create a vector of potential lambda values
# Range provided here is kind of overkill, but good for refinement.
lambda_vector <- 10^seq(5,-5, length=500)
set.seed(12345)
# Specify LASSO regression model to be estimated using the training data and 2-fold cross-validation framework/process
model_LASSO <- train(status ~ ., # . means "all others vectors"
data = train,
preProcess = c("center","scale"), # Grand mean center and standardize variables
method = "glmnet", # Method for LASSO regression
tuneGrid = expand.grid(alpha = 1, # Mixing percentage. Constant
lambda = lambda_vector), # DF for model to test tuning parameters
trControl = ctrlspecs, # Train LASSO using k-fold cross-validation
na.action = na.omit, # If NAs encountered, use listwise deletion.
family = "binomial")
# Best (optimal) tuning parameters (alpha, lambda)
# Optimal lambda = 0.03524473
# Best tuning parameter to minimize the root mean squared error (RMSE) of model
model_LASSO$bestTune
model_LASSO$bestTune$lambda # Directly access best lambda
# LASSO regression model coefficients (parameter estimates)
coef(model_LASSO$finalModel, # Select the final model coefficients
model_LASSO$bestTune$lambda) # at the best lambda value
# Plot log(lambda) & RMSE
plot(log(model_LASSO$results$lambda),
model_LASSO$results$RMSE,
xlab = "log(lambda)",
ylab = "RMSE")
# Variable importance
varImp(model_LASSO)
# Data visualization of variable importance
# install.packages("ggplot2")
library(ggplot2)
ggplot(varImp(model_LASSO))
### Model prediction
# Goal: See how well our model predicts when we give it new data
predictions_LASSO <- predict(model_LASSO, # Use trained model
newdata = test) # To predict outcome with test data
# Model performance/accuracy
model_LASSO_perf <- data.frame(RMSE = RMSE(predictions_LASSO,
test$status),
Rsquared = R2(predictions_LASSO,
test$status))
发生这种情况只是因为 RMSE 和 R 平方对于因子结果没有意义。您必须使用 caret::confusionMatrix
或将因子转换为整数(我认为这不是一个很好的选择):
confusionMatrix(predictions_LASSO,test$status)
model_LASSO_perf <- data.frame(RMSE = RMSE(as.integer(predictions_LASSO),
as.integer(test$status)),
Rsquared = R2(as.integer(predictions_LASSO),
as.integer(test$status)))