如何将 GridSearchCV 用于不同次数的多项式?

How to use GridSearchCV for polynomials of different degrees?

我想做的是用不同次数的多项式检查一些 OLS 拟合,看看在给定 horsepower 的情况下哪个次数在预测 mpg 方面表现更好(同时使用 LOOCV 和 KFold)。我写了代码,但我不知道如何使用 GridSearchCvPolynomialFeatures 函数应用于每次迭代,所以我最后写了这个:

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import LeaveOneOut, KFold
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error



df = pd.read_csv('http://web.stanford.edu/~oleg2/hse/auto/Auto.csv')[['horsepower','mpg']].dropna()

pows = range(1,11)
first, second, mse = [], [], 0     # 'first' is data for the first plot and 'second' is for the second one

for p in pows:
  mse = 0
  for train_index, test_index in LeaveOneOut().split(df):
      x_train, x_test = df.horsepower.iloc[train_index], df.horsepower.iloc[test_index]
      y_train, y_test = df.mpg.iloc[train_index], df.mpg.iloc[test_index]
      polynomial_features = PolynomialFeatures(degree = p)
      x = polynomial_features.fit_transform(x_train.values.reshape(-1,1))   #getting the polynomial
      ft = LinearRegression().fit(x,y_train)
      x1 = polynomial_features.fit_transform(x_test.values.reshape(-1,1))   #getting the polynomial
      mse += mean_squared_error(y_test, ft.predict(x1))
  first.append(mse/len(df))
    
for p in pows: 
    temp = []   
    for i in range(9):      # this is to plot a few graphs for comparison
        mse = 0
        for train_index, test_index in KFold(10, True).split(df):
            x_train, x_test = df.horsepower.iloc[train_index], df.horsepower.iloc[test_index]
            y_train, y_test = df.mpg.iloc[train_index], df.mpg.iloc[test_index]
            polynomial_features = PolynomialFeatures(degree = p)
            x = polynomial_features.fit_transform(x_train.values.reshape(-1,1))   #getting the polynomial
            ft = LinearRegression().fit(x,y_train)
            x1 = polynomial_features.fit_transform(x_test.values.reshape(-1,1))   #getting the polynomial
            mse += mean_squared_error(y_test, ft.predict(x1))
        temp.append(mse/10)
    second.append(temp)      


f, pt = plt.subplots(1,2,figsize=(12,5.1))
f.tight_layout(pad=5.0)
pt[0].set_ylim([14,30])
pt[1].set_ylim([14,30])
pt[0].plot(pows, first, color='darkblue', linewidth=1)
pt[0].scatter(pows, first, color='darkblue')
pt[1].plot(pows, second)
pt[0].set_title("LOOCV", fontsize=15)
pt[1].set_title("10-fold CV", fontsize=15)
pt[0].set_xlabel('Degree of Polynomial', fontsize=15)
pt[1].set_xlabel('Degree of Polynomial', fontsize=15)
pt[0].set_ylabel('Mean Squared Error', fontsize=15)
pt[1].set_ylabel('Mean Squared Error', fontsize=15)
plt.show()

它产生:

它完全可以工作,您可以 运行 在您的机器上对其进行测试。这正是我想要的,但似乎真的过分了。我正在征求有关如何使用 GridSearchCv 或其他任何东西改进它的建议,真的。我试图将 PolynomialFeatures 作为带有 LinearRegression() 的管道传递,但无法即时更改 x。一个工作示例将不胜感激。

这种事情似乎是这样做的方式:

pipe = Pipeline(steps=[
    ('poly', PolynomialFeatures(include_bias=False)),
    ('model', LinearRegression()),
])

search = GridSearchCV(
    estimator=pipe,
    param_grid={'poly__degree': list(pows)},
    scoring='neg_mean_squared_error',
    cv=LeaveOneOut(),
)

search.fit(df[['horsepower']], df.mpg)

first = -search.cv_results_['mean_test_score']

(最后一行为负数,因为得分手为负数)

然后绘图可以或多或少地以相同的方式进行。 (我们在这里依靠 cv_results_ 将条目按与 pows 相同的顺序排列;您可能希望使用 pd.DataFrame(search.cv_results_) 的适当列进行绘图。)

您可以使用 RepeatedKFold 来模拟您在 KFold 上的循环,尽管那样您只会得到一个情节;如果你真的想要单独的地块,那么你仍然需要外循环,但是用 cv=KFold(...) 的网格搜索可以代替内循环。