Python 中的 ElasticNetCV 与 R 中的 cvglmnet
ElasticNetCV in Python vs cvglmnet in R
有没有人试图通过在 Python 中实现 ElasticNetCV 和在 R 中实现 cvglmnet 来获得相同的结果?
我已经找到了如何在 Python 中的 ElasticNet 和 R 中的 glmnet 上制作它,但无法使用交叉验证方法重现它...
在Python中重现的步骤:
预处理:
from sklearn.datasets import make_regression
from sklearn.linear_model import ElasticNet, ElasticNetCV
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import pandas as pd
data = make_regression(
n_samples=100000,
random_state=0
)
X, y = data[0], data[1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25)
pd.DataFrame(X_train).to_csv('X_train.csv', index=None)
pd.DataFrame(X_test).to_csv('X_test.csv', index=None)
pd.DataFrame(y_train).to_csv('y_train.csv', index=None)
pd.DataFrame(y_test).to_csv('y_test.csv', index=None)
型号:
model = ElasticNet(
alpha=1.0,
l1_ratio=0.5,
fit_intercept=True,
normalize=True,
precompute=False,
max_iter=100000,
copy_X=True,
tol=0.0000001,
warm_start=False,
positive=False,
random_state=0,
selection='cyclic'
)
model.fit(
X=X_train,
y=y_train
)
y_pred = model.predict(
X=X_test
)
print(
mean_squared_error(
y_true=y_test,
y_pred=y_pred
)
)
输出:42399.49815189786
model = ElasticNetCV(
l1_ratio=0.5,
eps=0.001,
n_alphas=100,
alphas=None,
fit_intercept=True,
normalize=True,
precompute=False,
max_iter=100000,
tol=0.0000001,
cv=10,
copy_X=True,
verbose=0,
n_jobs=-1,
positive=False,
random_state=0,
selection='cyclic'
)
model.fit(
X=X_train,
y=y_train
)
y_pred = model.predict(
X=X_test
)
print(
mean_squared_error(
y_true=y_test,
y_pred=y_pred
)
)
输出:39354.729173913176
在 R 中重现的步骤:
预处理:
library(glmnet)
X_train <- read.csv(path)
X_test <- read.csv(path)
y_train <- read.csv(path)
y_test <- read.csv(path)
fit <- glmnet(x=as.matrix(X_train), y=as.matrix(y_train))
y_pred <- predict(fit, newx = as.matrix(X_test))
y_error = y_test - y_pred
mean(as.matrix(y_error)^2)
输出:42399.5
fit <- cv.glmnet(x=as.matrix(X_train), y=as.matrix(y_train))
y_pred <- predict(fit, newx = as.matrix(X_test))
y_error <- y_test - y_pred
mean(as.matrix(y_error)^2)
输出:37.00207
非常感谢您提供示例,我在笔记本电脑上,所以我不得不将样本数量减少到 100:
from sklearn.datasets import make_regression
from sklearn.linear_model import ElasticNet, ElasticNetCV
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import pandas as pd
data = make_regression(
n_samples=100,
random_state=0
)
X, y = data[0], data[1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25)
当你使用 glmnet 进行预测时,你需要指定 lambda,否则它 returns 对所有 lambda 的预测,所以在 R:
fit <- glmnet(x=as.matrix(X_train), y=as.matrix(y_train))
y_pred <- predict(fit, newx = as.matrix(X_test))
dim(y_pred)
[1] 25 89
当你 运行 cv.glmnet 时,它 select 是 cv 中最好的 lambda,lambda.1se,所以它只给你 1 组,这是你想要的 rmse :
fit <- cv.glmnet(x=as.matrix(X_train), y=as.matrix(y_train))
y_pred <- predict(fit, newx = as.matrix(X_test))
y_error <- y_test - y_pred
mean(as.matrix(y_error)^2)
[1] 22.03504
dim(y_error)
[1] 25 1
fit$lambda.1se
[1] 1.278699
如果我们 select 最接近 glmnet 中 cv.glmnet 选择的 lambda,你会得到正确范围内的东西:
fit <- glmnet(x=as.matrix(X_train), y=as.matrix(y_train))
sel = which.min(fit$lambda-1.278699)
y_pred <- predict(fit, newx = as.matrix(X_test))[,sel]
mean((y_test - y_pred)^2)
dim(y_error)
mean(as.matrix((y_test - y_pred)^2))
[1] 20.0775
在我们与 sklearn 进行比较之前,我们需要确保我们在相同的 lambda 范围内进行测试。
L = c(0.01,0.05,0.1,0.2,0.5,1,2)
fit <- cv.glmnet(x=as.matrix(X_train), y=as.matrix(y_train),lambda=L)
y_pred <- predict(fit, newx = as.matrix(X_test))
y_error <- y_test - y_pred
mean(as.matrix(y_error)^2)
[1] 0.003065869
所以我们期望在 0.003065869 范围内。我们 运行 它与相同的 lambda,lambda 在 ElasticNet 中被称为 alpha。 glmnet 中的 alpha 实际上是您的 l1_ratio,参见 vignette。并且 normalize 选项应该设置为 False,因为:
If True, the regressors X will be normalized before regression by
subtracting the mean and dividing by the l2-norm. If you wish to
standardize, please use sklearn.preprocessing.StandardScaler before
calling fit on an estimator with normalize=False.
所以我们只是 运行 使用 CV:
model = ElasticNetCV(l1_ratio=1,fit_intercept=True,alphas=[0.01,0.05,0.1,0.2,0.5,1,2])
model.fit(X=X_train,y=y_train)
y_pred = model.predict(X=X_test)
mean_squared_error(y_true=y_test,y_pred=y_pred)
0.0018007824874741929
它与 R 结果大致相同。
如果您为 ElasticNet 执行此操作,如果您指定 alpha,您将得到相同的结果。
有没有人试图通过在 Python 中实现 ElasticNetCV 和在 R 中实现 cvglmnet 来获得相同的结果? 我已经找到了如何在 Python 中的 ElasticNet 和 R 中的 glmnet 上制作它,但无法使用交叉验证方法重现它...
在Python中重现的步骤:
预处理:
from sklearn.datasets import make_regression
from sklearn.linear_model import ElasticNet, ElasticNetCV
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import pandas as pd
data = make_regression(
n_samples=100000,
random_state=0
)
X, y = data[0], data[1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25)
pd.DataFrame(X_train).to_csv('X_train.csv', index=None)
pd.DataFrame(X_test).to_csv('X_test.csv', index=None)
pd.DataFrame(y_train).to_csv('y_train.csv', index=None)
pd.DataFrame(y_test).to_csv('y_test.csv', index=None)
型号:
model = ElasticNet(
alpha=1.0,
l1_ratio=0.5,
fit_intercept=True,
normalize=True,
precompute=False,
max_iter=100000,
copy_X=True,
tol=0.0000001,
warm_start=False,
positive=False,
random_state=0,
selection='cyclic'
)
model.fit(
X=X_train,
y=y_train
)
y_pred = model.predict(
X=X_test
)
print(
mean_squared_error(
y_true=y_test,
y_pred=y_pred
)
)
输出:42399.49815189786
model = ElasticNetCV(
l1_ratio=0.5,
eps=0.001,
n_alphas=100,
alphas=None,
fit_intercept=True,
normalize=True,
precompute=False,
max_iter=100000,
tol=0.0000001,
cv=10,
copy_X=True,
verbose=0,
n_jobs=-1,
positive=False,
random_state=0,
selection='cyclic'
)
model.fit(
X=X_train,
y=y_train
)
y_pred = model.predict(
X=X_test
)
print(
mean_squared_error(
y_true=y_test,
y_pred=y_pred
)
)
输出:39354.729173913176
在 R 中重现的步骤:
预处理:
library(glmnet)
X_train <- read.csv(path)
X_test <- read.csv(path)
y_train <- read.csv(path)
y_test <- read.csv(path)
fit <- glmnet(x=as.matrix(X_train), y=as.matrix(y_train))
y_pred <- predict(fit, newx = as.matrix(X_test))
y_error = y_test - y_pred
mean(as.matrix(y_error)^2)
输出:42399.5
fit <- cv.glmnet(x=as.matrix(X_train), y=as.matrix(y_train))
y_pred <- predict(fit, newx = as.matrix(X_test))
y_error <- y_test - y_pred
mean(as.matrix(y_error)^2)
输出:37.00207
非常感谢您提供示例,我在笔记本电脑上,所以我不得不将样本数量减少到 100:
from sklearn.datasets import make_regression
from sklearn.linear_model import ElasticNet, ElasticNetCV
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import pandas as pd
data = make_regression(
n_samples=100,
random_state=0
)
X, y = data[0], data[1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25)
当你使用 glmnet 进行预测时,你需要指定 lambda,否则它 returns 对所有 lambda 的预测,所以在 R:
fit <- glmnet(x=as.matrix(X_train), y=as.matrix(y_train))
y_pred <- predict(fit, newx = as.matrix(X_test))
dim(y_pred)
[1] 25 89
当你 运行 cv.glmnet 时,它 select 是 cv 中最好的 lambda,lambda.1se,所以它只给你 1 组,这是你想要的 rmse :
fit <- cv.glmnet(x=as.matrix(X_train), y=as.matrix(y_train))
y_pred <- predict(fit, newx = as.matrix(X_test))
y_error <- y_test - y_pred
mean(as.matrix(y_error)^2)
[1] 22.03504
dim(y_error)
[1] 25 1
fit$lambda.1se
[1] 1.278699
如果我们 select 最接近 glmnet 中 cv.glmnet 选择的 lambda,你会得到正确范围内的东西:
fit <- glmnet(x=as.matrix(X_train), y=as.matrix(y_train))
sel = which.min(fit$lambda-1.278699)
y_pred <- predict(fit, newx = as.matrix(X_test))[,sel]
mean((y_test - y_pred)^2)
dim(y_error)
mean(as.matrix((y_test - y_pred)^2))
[1] 20.0775
在我们与 sklearn 进行比较之前,我们需要确保我们在相同的 lambda 范围内进行测试。
L = c(0.01,0.05,0.1,0.2,0.5,1,2)
fit <- cv.glmnet(x=as.matrix(X_train), y=as.matrix(y_train),lambda=L)
y_pred <- predict(fit, newx = as.matrix(X_test))
y_error <- y_test - y_pred
mean(as.matrix(y_error)^2)
[1] 0.003065869
所以我们期望在 0.003065869 范围内。我们 运行 它与相同的 lambda,lambda 在 ElasticNet 中被称为 alpha。 glmnet 中的 alpha 实际上是您的 l1_ratio,参见 vignette。并且 normalize 选项应该设置为 False,因为:
If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm. If you wish to standardize, please use sklearn.preprocessing.StandardScaler before calling fit on an estimator with normalize=False.
所以我们只是 运行 使用 CV:
model = ElasticNetCV(l1_ratio=1,fit_intercept=True,alphas=[0.01,0.05,0.1,0.2,0.5,1,2])
model.fit(X=X_train,y=y_train)
y_pred = model.predict(X=X_test)
mean_squared_error(y_true=y_test,y_pred=y_pred)
0.0018007824874741929
它与 R 结果大致相同。
如果您为 ElasticNet 执行此操作,如果您指定 alpha,您将得到相同的结果。