多项式回归预测直线

Question

我想进行多项式回归（三次），但如果我低于四次，我只会得到直线。我没有数据分析方面的经验，基本上只是从某处复制代码并将我的数据提供给它。我的 y 数据是标准化的，x 值只是 1950-2018 年。

问题从x值开始，如果我用年份它根本不起作用，无论我选择哪个度数它都会简单地预测一条直线。但是，如果我使用索引来拟合模型，它至少适用于 4 级及更高级别

但是我只想考二三度。

我的代码：

x = np.array(list(range(1, 70)))
y = np.array([-1.07312323, -1.12360264, -1.16848888, -1.21237286, -1.24931163,
   -1.24563078, -1.25029589, -1.25804974, -1.26992981, -1.2759396 ,
   -1.31707672, -1.28845207, -1.2553561 , -1.21670196, -1.17405228,
   -1.13823657, -1.10201293, -1.0652651 , -1.01830663, -0.95872599,
   -0.86864519, -0.77287454, -0.67380868, -0.56936508, -0.47234488,
   -0.38025164, -0.28073984, -0.17953134, -0.08026437,  0.01376177,
    0.09177617,  0.15270399,  0.2005737 ,  0.23841612,  0.2860362 ,
    0.34606907,  0.39385415,  0.44154466,  0.49050035,  0.5338063 ,
    0.58003198,  0.61416929,  0.59416923,  0.56887929,  0.53366038,
    0.4907952 ,  0.45338928,  0.40975728,  0.35098762,  0.29307093,
    0.24168722,  0.21576624,  0.25267974,  0.3066606 ,  0.37672389,
    0.45321951,  0.53410345,  0.62491894,  0.72720349,  0.81841313,
    0.9213128 ,  1.03645707,  1.15479503,  1.25998302,  1.35221566,
    1.44653627,  1.52833712,  1.60458778,  1.68225894])

# transforming the data to include another axis
x = x[:, np.newaxis]
y = y[:, np.newaxis]

polynomial_features= PolynomialFeatures(degree=4)
x_poly = polynomial_features.fit_transform(x)

model = LinearRegression()
model.fit(x_poly, y)
y_poly_pred = model.predict(x_poly)

plt.scatter(x, y, s=10)
plt.plot(x, y_poly_pred, color='m')
plt.show()

4 阶代码的结果图：

如果我的学位为 3 或以下，这就是它的样子：

是我的数据格式不对还是代码不工作？我尝试了其他几个代码片段并遇到了类似的问题。还是我根本不知道多项式回归并遗漏了一些重要的东西？

顺便说一句，我不在乎多项式回归是否是数据的正确模型，我必须这样做。

Answer 1

来自 PolyFeatures 的 sklearn 文档：

"生成一个新的特征矩阵，由度数小于或等于的特征的所有多项式组合组成到指定度数。" 因此，您不能保证获得您指定的学位，它只是最高学位。如果我自己没有使用过这个函数，我猜它是最合适的。在您的情况下，很容易想象具有 2 次和 3 次的多项式实际上可能比 1 次多项式（直线）更差地逼近您的数据，因为它们具有特定数量的极值点和拐点.因此，我怀疑代码会一直保持直线，直到您允许度数达到 4，其中 3 个极值点（两个拐点）是可能的，这非常适合您的数据。

您可能更适合使用的函数是 numpy.polyfit。

import numpy as np
x = df.index.values
y = df['0'].to_numpy()

degree = 4
# Fit coefficients
coeffs = np.polyfit(x, y, degree)
# Generate polynome function f(x)
f = np.poly1d(coeffs)


plt.scatter(x, y, s=10)
plt.plot(x, f(x), color='m')
plt.show()

Answer 2

首先说几点。为了完全重现，还要说明您的包以及您从哪里获得哪些功能。其次，正如@Vinzent 在评论中已经提到的那样，更高阶的多项式总是能更好地拟合您的数据——这是泰勒级数的基础。第三，让我们探索一下您的模型中发生了什么。你说它是一条直线，它行不通。好吧，系数还有其他含义：

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
import numpy as np
import matplotlib.pyplot as plt


x = np.arange(1,70)
y = np.array([-1.07312323, -1.12360264, -1.16848888, -1.21237286, -1.24931163,
   -1.24563078, -1.25029589, -1.25804974, -1.26992981, -1.2759396 ,
   -1.31707672, -1.28845207, -1.2553561 , -1.21670196, -1.17405228,
   -1.13823657, -1.10201293, -1.0652651 , -1.01830663, -0.95872599,
   -0.86864519, -0.77287454, -0.67380868, -0.56936508, -0.47234488,
   -0.38025164, -0.28073984, -0.17953134, -0.08026437,  0.01376177,
    0.09177617,  0.15270399,  0.2005737 ,  0.23841612,  0.2860362 ,
    0.34606907,  0.39385415,  0.44154466,  0.49050035,  0.5338063 ,
    0.58003198,  0.61416929,  0.59416923,  0.56887929,  0.53366038,
    0.4907952 ,  0.45338928,  0.40975728,  0.35098762,  0.29307093,
    0.24168722,  0.21576624,  0.25267974,  0.3066606 ,  0.37672389,
    0.45321951,  0.53410345,  0.62491894,  0.72720349,  0.81841313,
    0.9213128 ,  1.03645707,  1.15479503,  1.25998302,  1.35221566,
    1.44653627,  1.52833712,  1.60458778,  1.68225894])

# transforming the data to include another axis
x_new = x[:, np.newaxis].copy()
y_new = y[:, np.newaxis].copy()

for i in range(1,10):
    print(f"Degree {i}")
    polynomial_features= PolynomialFeatures(degree=i)
    x_poly = polynomial_features.fit_transform(x_new)

    model = LinearRegression()
    model.fit(x_poly, y_new)
    y_poly_pred = model.predict(x_poly)
    print("Scklearn: ", model.coef_[0], model.intercept_)
    
    coeffs = np.polyfit(x, y, i)
    # Generate polynome function f(x)
    f = np.poly1d(coeffs)
    print("Numpy: \n", f)

    plt.scatter(x, y, s=10)
    plt.plot(x_new, y_poly_pred, color='m',label='sklearn')
    plt.plot(x, f(x), color='r', label='np')
    plt.legend()
    plt.show()

这会给你：

Degree 1
Scklearn:  [0.         0.04205913] [-1.520228]
Numpy: 
  
0.04206 x - 1.52
Degree 2
Scklearn:  [ 0.00000000e+00  4.58554678e-02 -5.42334612e-05] [-1.56515139]
Numpy: 
             2
-5.423e-05 x + 0.04586 x - 1.565
Degree 3
Scklearn:  [ 0.00000000e+00  4.21624962e-02  7.67141331e-05 -1.24711995e-06] [-1.54283792]
Numpy: 
             3             2
-1.247e-06 x + 7.671e-05 x + 0.04216 x - 1.543
Degree 4
Scklearn:  [ 0.00000000e+00 -1.79439905e-01  1.40847170e-02 -3.11025813e-04
  2.21270495e-06] [-0.71710953]
Numpy: 
            4            3           2
2.213e-06 x - 0.000311 x + 0.01408 x - 0.1794 x - 0.7171
Degree 5
Scklearn:  [ 0.00000000e+00 -1.55916194e-01  1.17996085e-02 -2.24932455e-04
  8.34195933e-07  7.87719440e-09] [-0.77753422]
Numpy: 
            5             4             3          2
7.877e-09 x + 8.342e-07 x - 0.0002249 x + 0.0118 x - 0.1559 x - 0.7775

我要到你提到的四级以上，只是为了说明情况。这里注意几件事。如果它是一条直线，您会期望所有大于一阶的系数都为零，这是不正确的。但是，如果您看一下例如。 3 级，x^2 得到 7.67141331e-05，x^3 得到 -1.24711995e-06，这非常接近于零，因此您不会期望它们在您的结果中发挥重要作用。

另一个证明它不是线性拟合的证据是，如果你使用 sklearn 的 R^2 来检查你有多接近。如果你总是有一条直线，那么你的错误不应该改变。但它会增长，因为它应该，如果你检查它（只需在代码中添加 print(model.score(x_poly, y_new))）：

Degree 1
0.9078883471490475
Degree 2
0.9083670747869753
Degree 3
0.9084444071378761
Degree 4
0.9817855101145114
Degree 5
0.9820634174954168

你注意到值中有什么有趣的地方了吗？也许是模特健身的突然跳跃？或者前三个值非常相同（尽管有一些改进）？好吧，在你对直线不满意之前，从四级开始你对你的模型感到满意，这不是巧合吗？这就是为什么您希望始终看到您的模型在做什么以及它在视觉上和数字上的表现如何。

换句话说，你的代码片段没有问题，只是碰巧更高阶的系数非常接近于零，看起来像一条直线。

注意：我在您原来的 post 中包括了 sklearn 和 numpy，因为其他人已经提到它并且它有更好的系数打印。在这种情况下，它们几乎相同。

但是，它们在您的原始案例中并不完全相同，您的问题更深入。 sklearn 与你的岁月不符的原因要深得多，我建议阅读 this article。简而言之，您需要缩放变量，见下文。另外，查看多项式回归和插值之间的差异。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn import preprocessing

x = np.arange(1950,2019)

y = np.array([-1.07312323, -1.12360264, -1.16848888, -1.21237286, -1.24931163,
   -1.24563078, -1.25029589, -1.25804974, -1.26992981, -1.2759396 ,
   -1.31707672, -1.28845207, -1.2553561 , -1.21670196, -1.17405228,
   -1.13823657, -1.10201293, -1.0652651 , -1.01830663, -0.95872599,
   -0.86864519, -0.77287454, -0.67380868, -0.56936508, -0.47234488,
   -0.38025164, -0.28073984, -0.17953134, -0.08026437,  0.01376177,
    0.09177617,  0.15270399,  0.2005737 ,  0.23841612,  0.2860362 ,
    0.34606907,  0.39385415,  0.44154466,  0.49050035,  0.5338063 ,
    0.58003198,  0.61416929,  0.59416923,  0.56887929,  0.53366038,
    0.4907952 ,  0.45338928,  0.40975728,  0.35098762,  0.29307093,
    0.24168722,  0.21576624,  0.25267974,  0.3066606 ,  0.37672389,
    0.45321951,  0.53410345,  0.62491894,  0.72720349,  0.81841313,
    0.9213128 ,  1.03645707,  1.15479503,  1.25998302,  1.35221566,
    1.44653627,  1.52833712,  1.60458778,  1.68225894])

# transforming the data to include another axis
x_new = x[:, np.newaxis].copy()
y_new = y[:, np.newaxis].copy()

# scaling
scaler = preprocessing.StandardScaler()
polyreg_scaled=make_pipeline(PolynomialFeatures(4),scaler,LinearRegression())
polyreg_scaled.fit(x_new,y_new)

# no scaling
model = LinearRegression()
model.fit(x_poly, y_new)
y_poly_pred = model.predict(x_poly)


plt.scatter(x, y, s=10)
plt.plot(x_new, polyreg_scaled.predict(x_new), color='g',label='sklearn scaled')
plt.plot(x_new, y_poly_pred, color='m',label='sklearn')
plt.legend()
plt.show()

将产生：

如您所见，缩放版本效果不错，但未缩放版本很糟糕。

多项式回归预测直线

Polynomial regression predicts a straight line

python

data-analysis