多元线性回归不会随着多项式次数的增加而变得更准确吗？

Question

我正在计算训练集上的均方误差，所以我希望使用更高的多项式时均方误差会降低。但是，从 4 级到 5 级，MSE 显着增加。可能是什么原因？

import pandas as pd, numpy as np from sklearn.preprocessing import PolynomialFeatures, StandardScaler from sklearn.linear_model import LinearRegression from sklearn.pipeline import Pipeline from sklearn.metrics import mean_squared_error import matplotlib.pyplot as plt path = "https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/automobileEDA.csv" df = pd.read_csv(path) r=[] max_degrees = 10 y = df['price'].astype('float') x = df[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']].astype('float') for i in range(1,max_degrees+1): Input = [('scale', StandardScaler()), ('polynomial', PolynomialFeatures(degree=i)), ('model', LinearRegression())] pipe = Pipeline(Input) pipe.fit(x,y) yhat = pipe.predict(x) r.append(mean_squared_error(yhat, y)) print("MSE for MLR of degree "+str(i)+" = "+str(round(mean_squared_error(yhat, y)/1e6,1))) plt.figure(figsize=(10,3)) plt.plot(list(range(1,max_degrees+1)),r) plt.show()

结果：

Answer 1

最初，您在 y 中有 200 个观测值，在 X 中有 4 个特征（列），然后将其缩放并转换为多项式特征。

因此，4 级具有 120 < 200 个多项式特征，而 5 级是第一个具有 210 > 200 个多项式特征的，即特征多于观测值。

当特征多于观测值时，线性回归定义不明确，不应使用，如 here 所述。这或许可以解释当从 4 级推进到 5 级时，拟合训练集的突然恶化。对于更高的阶数，LR 求解器似乎仍然能够过度拟合训练数据。

多元线性回归不会随着多项式次数的增加而变得更准确吗？

Multivariable linear regression doesn't get more accurate with higher polynomial degree?

python

regression

mse

scikit-learn