基于 Numpy 的梯度下降没有完全收敛

Numpy based gradient descent not fully converging

我相信我已经正确地实现了 GD(部分基于 Aurelien Geron 的书),但它没有返回与 sklearn 的线性回归相同的结果。这是完整的笔记本: https://colab.research.google.com/drive/17lvCb_F_vMskT1PxbrKCSR57B5lMWT7A?usp=sharing

我没有做任何花哨的事情,这里是加载训练数据的代码:

import numpy as np
import pandas as pd
import sklearn.datasets

#load data
data_arr = sklearn.datasets.load_diabetes(as_frame=True).data.values

X_raw = data_arr[:,1:] 
y_raw = data_arr[:, 1:2]

#add bias
X = np.hstack((np.ones(y_raw.shape),X_raw))
y = y_raw

#do gradient descent
learning_rate = 0.001
iterations = 1_000_000

observations = X.shape[0]
features = X.shape[1]

w = np.ones((features,1))

for i in range(iterations):
    w -= (learning_rate) * (2/observations) * X.T.dot(X.dot(w) - y)

这里是产生的权重:

array([[ 2.72774600e-17],
       [ 1.01847403e+00],
       [ 3.87858604e-02],
       [ 3.06547577e-04],
       [-3.67525543e-01],
       [ 9.09006216e-02],
       [ 4.21512716e-01],
       [ 4.25673672e-01],
       [ 4.77147289e-02],
       [-8.14471370e-03]])

和 MSE:5.24937033143115e-05

这是 sklearn 给我的:

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

%time reg = LinearRegression().fit(X, y)
reg.coef_

sklearn 权重:

array([[ 0.00000000e+00,  1.00000000e+00, -9.99200722e-16,
        -1.69309011e-15, -1.11022302e-16,  1.38777878e-15,
        -3.88578059e-16,  6.80011603e-16, -8.32667268e-17,
        -5.55111512e-16]])

sklearn MSE:1.697650600978984e-32

我已经尝试 increase/decrease 时期的数量和学习率的大小。 Scikit-learn returns 在几毫秒内得到结果。我的 GD 实现可以 运行 几分钟,但仍然无法接近 sklearn 的结果。

我是不是做错了什么?

(笔记本包含此代码的更简洁版本。)

您的代码中存在一个小错误,因为 X_raw 的第一列与 y_raw 相同,即目标被用作特征。这已在下面的代码中得到纠正。

另一个问题是,如果你在特征矩阵X中包含一列1,那么在用sklearn拟合线性回归时你应该确保设置fit_intercept=False,否则你将有特征矩阵中的两列。

也不清楚为什么要除以梯度更新中的观测值数量,因为这会显着降低学习率。

import numpy as np
import pandas as pd
import sklearn.datasets
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# load data
data_arr = sklearn.datasets.load_diabetes(as_frame=True).data.values

# extract features and target
X_raw = data_arr[:, 1:]
y_raw = data_arr[:, :1]

# add bias
X = np.hstack((np.ones(y_raw.shape), X_raw))
y = y_raw

# do gradient descent
learning_rate = 0.001
iterations = 1000000

observations = X.shape[0]
features = X.shape[1]

w = np.ones((features, 1))

for i in range(iterations):
    w -= 2 * learning_rate * X.T.dot(X.dot(w) - y)

# exclude the intercept as X already contains a column of ones
reg = LinearRegression(fit_intercept=False).fit(X, y)

# compare the estimated coefficients
res = pd.DataFrame({
    'manual': [format(x, '.6f') for x in w.flatten()],
    'sklearn': [format(x, '.6f') for x in reg.coef_.flatten()]
})

res
#       manual    sklearn
# 0  -0.000000  -0.000000
# 1   0.101424   0.101424
# 2  -0.006468  -0.006468
# 3   0.208211   0.208211
# 4  -0.128653  -0.128653
# 5   0.236556   0.236556
# 6   0.132544   0.132544
# 7  -0.039359  -0.039359
# 8   0.177129   0.177129
# 9   0.145396   0.145396

# compare the RMSE
print(format(mean_squared_error(y, X.dot(w), squared=False), '.6f'))
# 0.043111

print(format(mean_squared_error(y, reg.predict(X), squared=False), '.6f'))
# 0.043111