这种梯度下降的实现有什么问题?
What is the issue with this implementation of gradient descent?
我尝试用梯度下降实现线性回归,但我的误差趋于无穷大。我已经阅读了我的代码,但仍然无法弄清楚我哪里出错了。我希望有人可以帮助我调试为什么这种线性回归的实现不起作用。
当N=100
时没有问题,但当N=1000
时则观察到发散到无穷大。
import numpy as np
class Regression:
def __init__(self, xs, ys, w,alpha):
self.w = w
self.xs = xs
self.ys = ys
self.a = alpha
self.N = float(len(xs))
def error(self, ys, yhat):
return (1./self.N)*np.sum((ys-yhat)**2)
def propagate(self):
yhat = xs*w[0]+w[1]
loss = yhat - self.ys
r1 = (2./self.N)*np.sum(loss*self.xs)
r2 = (2./self.N)*np.sum(loss)
self.w[0] -= self.a*r1
self.w[1] -= self.a*r2
N = 600
xs = np.arange(0,N)
bias = np.random.sample(size=N)*10
ys = xs * 2. + 2. + bias
ws = np.array([0.,0.])
regressor = Regression(
xs, ys, ws,
0.00001)
for i in range(1000):
regressor.propagate()
输出:
...
2.71623180177e+286
5.27841816362e+286
1.02574818143e+287
1.99332318715e+287
3.87359919362e+287
7.52751526171e+287
1.46281231441e+288
2.84266426942e+288
5.52411274435e+288
1.07349369184e+289
2.0861064206e+289
4.05390365232e+289
7.87789858657e+289
1.5309018532e+290
2.97498179035e+290
5.78124367308e+290
1.12346161297e+291
2.18320843611e+291
4.24260074438e+291
8.2445912074e+291
1.6021607564e+292
3.11345829619e+292
6.05034327761e+292
1.17575539141e+293
2.28483026006e+293
4.4400811218e+293
8.62835227315e+293
随着 N
的增加,起始点 w=[0,0]
处的梯度分量 r1
和 r2
分别以 N
的二次方和线性方式缩放.对于足够大的 N
,向量的初始步长 w
变得大于其误差的两倍,这会导致校正过冲并实际上 增加 误差。正反馈导致 w
围绕正确值振荡,幅度不断增加而不是收敛。
如果将 alpha
缩小十倍,您会发现 N=1000
会收敛。
您已经超出了您方法的收敛半径。我在 propagate:
的底部输入了一个打印语句来跟踪效果
self.w = np.array(res).astype(np.float)
print self.error(ys, yhat), '\t', r1, '\t', r2, '\t', self.w
作为K.A。 Buhr 指出,r1 与 N 呈二次方关系。根据输入选择你的学习率;它不是 SGD 算法承诺的常量。这是 N=600 的前 20 次迭代的输出,如您的代码所示:
486826.997899 -482786.592791 -1211.52883528 [ 4.82786593 0.01211529]
946024.542374 673013.376697 1680.38708612 [-1.90226784 -0.00468858]
1838377.19732 -938192.956012 -2350.99664804 [ 7.47966172 0.01882138]
3572474.5816 1307858.19046 3268.82617841 [-5.59892018 -0.01386688]
6942323.62211 -1823178.2573 -4565.30975898 [ 12.63286239 0.03178622]
13490907.7204 2541543.91414 6355.61930844 [-12.78257675 -0.03176997]
26216686.5837 -3542958.75828 -8868.35584965 [ 22.64701083 0.05691359]
50946528.2176 4938949.44036 12354.1444796 [-26.74248357 -0.06662786]
99003709.9274 -6884985.98436 -17230.4097511 [ 42.10737627 0.10567624]
192392610.191 9597796.6223 24011.0009034 [-53.87058995 -0.13443377]
373874053.385 -13379504.31 -33480.2810842 [ 79.92445315 0.20036904]
726544597.0 18651274.1534 46663.6193386 [-106.58828839 -0.26626715]
1411884707.51 -26000217.8559 -65058.4461128 [ 153.41389017 0.38431731]
2743697288.89 36244780.0586 90684.1600127 [-209.03391041 -0.52252429]
5331791469.79 -50525887.4157 -126423.886221 [ 296.22496374 0.74171457]
10361201450.4 70434012.7562 176228.707876 [-408.11516382 -1.02057251]
20134788880.2 -98186304.1721 -245674.553107 [ 573.7478779 1.43617302]
39127675046.8 136873506.894 342466.322375 [-794.98719104 -1.9884902 ]
76036305324.8 -190804176.229 -477412.833248 [ 1113.05457125 2.78563813]
147760369643.0 265984517.38 665513.730619 [-1546.79060255 -3.86949918]
但是,如果将 alpha 设置为 E-6(而不是 E-5),前 10 行是
14495.6359775 -13788.3126768 -211.542964687 [ 0.01378831 0.00021154]
14306.0982004 -13697.7438847 -210.177498646 [ 0.02748606 0.00042172]
14119.0422005 -13607.7699931 -208.821001646 [ 0.04109383 0.00063054]
13934.4354818 -13518.3870942 -207.473414775 [ 0.05461221 0.00083801]
13752.2459738 -13429.5913063 -206.134679506 [ 0.0680418 0.00104415]
13572.4420258 -13341.3787729 -204.804737697 [ 0.08138318 0.00124895]
13394.9924018 -13253.7456628 -203.483531589 [ 0.09463693 0.00145244]
13219.8662747 -13166.6881702 -202.171003801 [ 0.10780362 0.00165461]
13047.0332208 -13080.202514 -200.867097331 [ 0.12088382 0.00185548]
12876.4632151 -12994.2849383 -199.571755548 [ 0.13387811 0.00205505]
12708.1266257 -12908.9317115 -198.284922195 [ 0.14678704 0.00225333]
...它继续收敛。顺便说一句,即使在 N=600 时,1000 次迭代也不足以实现适当的收敛;您可能想要使用 epsilon 数字而不是迭代次数。
我尝试用梯度下降实现线性回归,但我的误差趋于无穷大。我已经阅读了我的代码,但仍然无法弄清楚我哪里出错了。我希望有人可以帮助我调试为什么这种线性回归的实现不起作用。
当N=100
时没有问题,但当N=1000
时则观察到发散到无穷大。
import numpy as np
class Regression:
def __init__(self, xs, ys, w,alpha):
self.w = w
self.xs = xs
self.ys = ys
self.a = alpha
self.N = float(len(xs))
def error(self, ys, yhat):
return (1./self.N)*np.sum((ys-yhat)**2)
def propagate(self):
yhat = xs*w[0]+w[1]
loss = yhat - self.ys
r1 = (2./self.N)*np.sum(loss*self.xs)
r2 = (2./self.N)*np.sum(loss)
self.w[0] -= self.a*r1
self.w[1] -= self.a*r2
N = 600
xs = np.arange(0,N)
bias = np.random.sample(size=N)*10
ys = xs * 2. + 2. + bias
ws = np.array([0.,0.])
regressor = Regression(
xs, ys, ws,
0.00001)
for i in range(1000):
regressor.propagate()
输出:
...
2.71623180177e+286
5.27841816362e+286
1.02574818143e+287
1.99332318715e+287
3.87359919362e+287
7.52751526171e+287
1.46281231441e+288
2.84266426942e+288
5.52411274435e+288
1.07349369184e+289
2.0861064206e+289
4.05390365232e+289
7.87789858657e+289
1.5309018532e+290
2.97498179035e+290
5.78124367308e+290
1.12346161297e+291
2.18320843611e+291
4.24260074438e+291
8.2445912074e+291
1.6021607564e+292
3.11345829619e+292
6.05034327761e+292
1.17575539141e+293
2.28483026006e+293
4.4400811218e+293
8.62835227315e+293
随着 N
的增加,起始点 w=[0,0]
处的梯度分量 r1
和 r2
分别以 N
的二次方和线性方式缩放.对于足够大的 N
,向量的初始步长 w
变得大于其误差的两倍,这会导致校正过冲并实际上 增加 误差。正反馈导致 w
围绕正确值振荡,幅度不断增加而不是收敛。
如果将 alpha
缩小十倍,您会发现 N=1000
会收敛。
您已经超出了您方法的收敛半径。我在 propagate:
的底部输入了一个打印语句来跟踪效果 self.w = np.array(res).astype(np.float)
print self.error(ys, yhat), '\t', r1, '\t', r2, '\t', self.w
作为K.A。 Buhr 指出,r1 与 N 呈二次方关系。根据输入选择你的学习率;它不是 SGD 算法承诺的常量。这是 N=600 的前 20 次迭代的输出,如您的代码所示:
486826.997899 -482786.592791 -1211.52883528 [ 4.82786593 0.01211529]
946024.542374 673013.376697 1680.38708612 [-1.90226784 -0.00468858]
1838377.19732 -938192.956012 -2350.99664804 [ 7.47966172 0.01882138]
3572474.5816 1307858.19046 3268.82617841 [-5.59892018 -0.01386688]
6942323.62211 -1823178.2573 -4565.30975898 [ 12.63286239 0.03178622]
13490907.7204 2541543.91414 6355.61930844 [-12.78257675 -0.03176997]
26216686.5837 -3542958.75828 -8868.35584965 [ 22.64701083 0.05691359]
50946528.2176 4938949.44036 12354.1444796 [-26.74248357 -0.06662786]
99003709.9274 -6884985.98436 -17230.4097511 [ 42.10737627 0.10567624]
192392610.191 9597796.6223 24011.0009034 [-53.87058995 -0.13443377]
373874053.385 -13379504.31 -33480.2810842 [ 79.92445315 0.20036904]
726544597.0 18651274.1534 46663.6193386 [-106.58828839 -0.26626715]
1411884707.51 -26000217.8559 -65058.4461128 [ 153.41389017 0.38431731]
2743697288.89 36244780.0586 90684.1600127 [-209.03391041 -0.52252429]
5331791469.79 -50525887.4157 -126423.886221 [ 296.22496374 0.74171457]
10361201450.4 70434012.7562 176228.707876 [-408.11516382 -1.02057251]
20134788880.2 -98186304.1721 -245674.553107 [ 573.7478779 1.43617302]
39127675046.8 136873506.894 342466.322375 [-794.98719104 -1.9884902 ]
76036305324.8 -190804176.229 -477412.833248 [ 1113.05457125 2.78563813]
147760369643.0 265984517.38 665513.730619 [-1546.79060255 -3.86949918]
但是,如果将 alpha 设置为 E-6(而不是 E-5),前 10 行是
14495.6359775 -13788.3126768 -211.542964687 [ 0.01378831 0.00021154]
14306.0982004 -13697.7438847 -210.177498646 [ 0.02748606 0.00042172]
14119.0422005 -13607.7699931 -208.821001646 [ 0.04109383 0.00063054]
13934.4354818 -13518.3870942 -207.473414775 [ 0.05461221 0.00083801]
13752.2459738 -13429.5913063 -206.134679506 [ 0.0680418 0.00104415]
13572.4420258 -13341.3787729 -204.804737697 [ 0.08138318 0.00124895]
13394.9924018 -13253.7456628 -203.483531589 [ 0.09463693 0.00145244]
13219.8662747 -13166.6881702 -202.171003801 [ 0.10780362 0.00165461]
13047.0332208 -13080.202514 -200.867097331 [ 0.12088382 0.00185548]
12876.4632151 -12994.2849383 -199.571755548 [ 0.13387811 0.00205505]
12708.1266257 -12908.9317115 -198.284922195 [ 0.14678704 0.00225333]
...它继续收敛。顺便说一句,即使在 N=600 时,1000 次迭代也不足以实现适当的收敛;您可能想要使用 epsilon 数字而不是迭代次数。