为什么在套索回归中计算 MSE 会给出不同的输出?
Why calculating MSE in lasso regression gives different outputs?
我正在尝试 运行 对来自 lasso2 包的前列腺癌数据进行不同的回归模型。当我使用 Lasso 时,我看到了两种不同的方法来计算均方误差。但它们确实给我带来了截然不同的结果,所以我想知道我是否做错了什么,或者这是否意味着一种方法比另一种更好?
# Needs the following R packages.
library(lasso2)
library(glmnet)
# Gets the prostate cancer dataset
data(Prostate)
# Defines the Mean Square Error function
mse = function(x,y) { mean((x-y)^2)}
# 75% of the sample size.
smp_size = floor(0.75 * nrow(Prostate))
# Sets the seed to make the partition reproductible.
set.seed(907)
train_ind = sample(seq_len(nrow(Prostate)), size = smp_size)
# Training set
train = Prostate[train_ind, ]
# Test set
test = Prostate[-train_ind, ]
# Creates matrices for independent and dependent variables.
xtrain = model.matrix(lpsa~. -1, data = train)
ytrain = train$lpsa
xtest = model.matrix(lpsa~. -1, data = test)
ytest = test$lpsa
# Fitting a linear model by Lasso regression on the "train" data set
pr.lasso = cv.glmnet(xtrain,ytrain,type.measure='mse',alpha=1)
lambda.lasso = pr.lasso$lambda.min
# Getting predictions on the "test" data set and calculating the mean square error
lasso.pred = predict(pr.lasso, s = lambda.lasso, newx = xtest)
# Calculating MSE via the mse function defined above
mse.1 = mse(lasso.pred,ytest)
cat("MSE (method 1): ", mse.1, "\n")
# Calculating MSE via the cvm attribute inside the pr.lasso object
mse.2 = pr.lasso$cvm[pr.lasso$lambda == lambda.lasso]
cat("MSE (method 2): ", mse.2, "\n")
所以这些是我为两个 MSE 获得的输出:
MSE (method 1): 0.4609978
MSE (method 2): 0.5654089
而且它们完全不同。有谁知道为什么?
非常感谢您的帮助!
塞缪尔
正如@alistaire 所指出的,在第一种情况下,您使用测试数据来计算 MSE,在第二种情况下,报告了来自交叉验证(训练)折叠的 MSE,因此它不是苹果与苹果比较。
我们可以做类似下面的事情来进行同类比较(通过保持训练折叠的拟合值)并且我们可以看到,如果在相同的训练中计算,mse.1 和 mse.2 完全相等折叠(虽然这个值和你的有点不同,我的桌面R版本3.1.2,x86_64-w64-mingw32,windows 10):
# Needs the following R packages.
library(lasso2)
library(glmnet)
# Gets the prostate cancer dataset
data(Prostate)
# Defines the Mean Square Error function
mse = function(x,y) { mean((x-y)^2)}
# 75% of the sample size.
smp_size = floor(0.75 * nrow(Prostate))
# Sets the seed to make the partition reproductible.
set.seed(907)
train_ind = sample(seq_len(nrow(Prostate)), size = smp_size)
# Training set
train = Prostate[train_ind, ]
# Test set
test = Prostate[-train_ind, ]
# Creates matrices for independent and dependent variables.
xtrain = model.matrix(lpsa~. -1, data = train)
ytrain = train$lpsa
xtest = model.matrix(lpsa~. -1, data = test)
ytest = test$lpsa
# Fitting a linear model by Lasso regression on the "train" data set
# keep the fitted values on the training folds
pr.lasso = cv.glmnet(xtrain,ytrain,type.measure='mse', keep=TRUE, alpha=1)
lambda.lasso = pr.lasso$lambda.min
lambda.id <- which(pr.lasso$lambda == pr.lasso$lambda.min)
# get the predicted values on the training folds with lambda.min (not from test data)
mse.1 = mse(pr.lasso$fit[,lambda.id], ytrain)
cat("MSE (method 1): ", mse.1, "\n")
MSE (method 1): 0.6044496
# Calculating MSE via the cvm attribute inside the pr.lasso object
mse.2 = pr.lasso$cvm[pr.lasso$lambda == lambda.lasso]
cat("MSE (method 2): ", mse.2, "\n")
MSE (method 2): 0.6044496
mse.1 == mse.2
[1] TRUE
我正在尝试 运行 对来自 lasso2 包的前列腺癌数据进行不同的回归模型。当我使用 Lasso 时,我看到了两种不同的方法来计算均方误差。但它们确实给我带来了截然不同的结果,所以我想知道我是否做错了什么,或者这是否意味着一种方法比另一种更好?
# Needs the following R packages.
library(lasso2)
library(glmnet)
# Gets the prostate cancer dataset
data(Prostate)
# Defines the Mean Square Error function
mse = function(x,y) { mean((x-y)^2)}
# 75% of the sample size.
smp_size = floor(0.75 * nrow(Prostate))
# Sets the seed to make the partition reproductible.
set.seed(907)
train_ind = sample(seq_len(nrow(Prostate)), size = smp_size)
# Training set
train = Prostate[train_ind, ]
# Test set
test = Prostate[-train_ind, ]
# Creates matrices for independent and dependent variables.
xtrain = model.matrix(lpsa~. -1, data = train)
ytrain = train$lpsa
xtest = model.matrix(lpsa~. -1, data = test)
ytest = test$lpsa
# Fitting a linear model by Lasso regression on the "train" data set
pr.lasso = cv.glmnet(xtrain,ytrain,type.measure='mse',alpha=1)
lambda.lasso = pr.lasso$lambda.min
# Getting predictions on the "test" data set and calculating the mean square error
lasso.pred = predict(pr.lasso, s = lambda.lasso, newx = xtest)
# Calculating MSE via the mse function defined above
mse.1 = mse(lasso.pred,ytest)
cat("MSE (method 1): ", mse.1, "\n")
# Calculating MSE via the cvm attribute inside the pr.lasso object
mse.2 = pr.lasso$cvm[pr.lasso$lambda == lambda.lasso]
cat("MSE (method 2): ", mse.2, "\n")
所以这些是我为两个 MSE 获得的输出:
MSE (method 1): 0.4609978
MSE (method 2): 0.5654089
而且它们完全不同。有谁知道为什么? 非常感谢您的帮助!
塞缪尔
正如@alistaire 所指出的,在第一种情况下,您使用测试数据来计算 MSE,在第二种情况下,报告了来自交叉验证(训练)折叠的 MSE,因此它不是苹果与苹果比较。
我们可以做类似下面的事情来进行同类比较(通过保持训练折叠的拟合值)并且我们可以看到,如果在相同的训练中计算,mse.1 和 mse.2 完全相等折叠(虽然这个值和你的有点不同,我的桌面R版本3.1.2,x86_64-w64-mingw32,windows 10):
# Needs the following R packages.
library(lasso2)
library(glmnet)
# Gets the prostate cancer dataset
data(Prostate)
# Defines the Mean Square Error function
mse = function(x,y) { mean((x-y)^2)}
# 75% of the sample size.
smp_size = floor(0.75 * nrow(Prostate))
# Sets the seed to make the partition reproductible.
set.seed(907)
train_ind = sample(seq_len(nrow(Prostate)), size = smp_size)
# Training set
train = Prostate[train_ind, ]
# Test set
test = Prostate[-train_ind, ]
# Creates matrices for independent and dependent variables.
xtrain = model.matrix(lpsa~. -1, data = train)
ytrain = train$lpsa
xtest = model.matrix(lpsa~. -1, data = test)
ytest = test$lpsa
# Fitting a linear model by Lasso regression on the "train" data set
# keep the fitted values on the training folds
pr.lasso = cv.glmnet(xtrain,ytrain,type.measure='mse', keep=TRUE, alpha=1)
lambda.lasso = pr.lasso$lambda.min
lambda.id <- which(pr.lasso$lambda == pr.lasso$lambda.min)
# get the predicted values on the training folds with lambda.min (not from test data)
mse.1 = mse(pr.lasso$fit[,lambda.id], ytrain)
cat("MSE (method 1): ", mse.1, "\n")
MSE (method 1): 0.6044496
# Calculating MSE via the cvm attribute inside the pr.lasso object
mse.2 = pr.lasso$cvm[pr.lasso$lambda == lambda.lasso]
cat("MSE (method 2): ", mse.2, "\n")
MSE (method 2): 0.6044496
mse.1 == mse.2
[1] TRUE