我应该使用带有 scikit-learn 的多项式回归的特征缩放吗?
Should I use feature scaling with polynomial regression with scikit-learn?
我一直在使用下面的代码对多项式函数进行套索回归。我的问题是我是否应该将特征缩放作为套索回归的一部分(在尝试拟合多项式函数时)。我在下面粘贴的代码中概述的 R^2 结果和绘图表明不是。感谢任何关于为什么不是这种情况或者我是否从根本上塞满了东西的建议。在此先感谢您的任何建议。
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
def answer_regression():
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Lasso, LinearRegression
from sklearn.metrics.regression import r2_score
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
scaler = MinMaxScaler()
global X_train, X_test, y_train, y_test
degrees = 12
poly = PolynomialFeatures(degree=degrees)
X_train_poly = poly.fit_transform(X_train.reshape(-1,1))
X_test_poly = poly.fit_transform(X_test.reshape(-1,1))
#Lasso Regression Model
X_train_scaled = scaler.fit_transform(X_train_poly)
X_test_scaled = scaler.transform(X_test_poly)
#No feature scaling
linlasso = Lasso(alpha=0.01, max_iter = 10000).fit(X_train_poly, y_train)
y_test_lassopredict = linlasso.predict(X_test_poly)
Lasso_R2_test_score = r2_score(y_test, y_test_lassopredict)
#With feature scaling
linlasso = Lasso(alpha=0.01, max_iter = 10000).fit(X_train_scaled, y_train)
y_test_lassopredict_scaled = linlasso.predict(X_test_scaled)
Lasso_R2_test_score_scaled = r2_score(y_test, y_test_lassopredict_scaled)
%matplotlib notebook
plt.figure()
plt.scatter(X_test, y_test, label='Test data')
plt.scatter(X_test, y_test_lassopredict, label='Predict data - No Scaling')
plt.scatter(X_test, y_test_lassopredict_scaled, label='Predict data - With Scaling')
return (Lasso_R2_test_score, Lasso_R2_test_score_scaled)
answer_regression()```
您的 X 范围在 [0,10] 左右,因此多项式特征的范围会大得多。没有缩放,它们的权重已经很小(因为它们的值较大),因此 Lasso 不需要将它们设置为零。如果缩放它们,它们的权重会大得多,Lasso 会将它们中的大部分设置为零。这就是为什么它对缩放情况的预测很差(需要这些特征来捕捉 y 的真实趋势)。
您可以通过获取两种情况的权重 (linlasso.coef_) 来确认这一点,您会看到第二种情况(缩放后的一种)的大部分权重设置为零。
您的 alpha 值似乎大于最佳值,应该进行调整。如果降低 alpha,两种情况都会得到相似的结果。
我一直在使用下面的代码对多项式函数进行套索回归。我的问题是我是否应该将特征缩放作为套索回归的一部分(在尝试拟合多项式函数时)。我在下面粘贴的代码中概述的 R^2 结果和绘图表明不是。感谢任何关于为什么不是这种情况或者我是否从根本上塞满了东西的建议。在此先感谢您的任何建议。
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
def answer_regression():
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Lasso, LinearRegression
from sklearn.metrics.regression import r2_score
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
scaler = MinMaxScaler()
global X_train, X_test, y_train, y_test
degrees = 12
poly = PolynomialFeatures(degree=degrees)
X_train_poly = poly.fit_transform(X_train.reshape(-1,1))
X_test_poly = poly.fit_transform(X_test.reshape(-1,1))
#Lasso Regression Model
X_train_scaled = scaler.fit_transform(X_train_poly)
X_test_scaled = scaler.transform(X_test_poly)
#No feature scaling
linlasso = Lasso(alpha=0.01, max_iter = 10000).fit(X_train_poly, y_train)
y_test_lassopredict = linlasso.predict(X_test_poly)
Lasso_R2_test_score = r2_score(y_test, y_test_lassopredict)
#With feature scaling
linlasso = Lasso(alpha=0.01, max_iter = 10000).fit(X_train_scaled, y_train)
y_test_lassopredict_scaled = linlasso.predict(X_test_scaled)
Lasso_R2_test_score_scaled = r2_score(y_test, y_test_lassopredict_scaled)
%matplotlib notebook
plt.figure()
plt.scatter(X_test, y_test, label='Test data')
plt.scatter(X_test, y_test_lassopredict, label='Predict data - No Scaling')
plt.scatter(X_test, y_test_lassopredict_scaled, label='Predict data - With Scaling')
return (Lasso_R2_test_score, Lasso_R2_test_score_scaled)
answer_regression()```
您的 X 范围在 [0,10] 左右,因此多项式特征的范围会大得多。没有缩放,它们的权重已经很小(因为它们的值较大),因此 Lasso 不需要将它们设置为零。如果缩放它们,它们的权重会大得多,Lasso 会将它们中的大部分设置为零。这就是为什么它对缩放情况的预测很差(需要这些特征来捕捉 y 的真实趋势)。
您可以通过获取两种情况的权重 (linlasso.coef_) 来确认这一点,您会看到第二种情况(缩放后的一种)的大部分权重设置为零。
您的 alpha 值似乎大于最佳值,应该进行调整。如果降低 alpha,两种情况都会得到相似的结果。