是什么决定了我的 Python 梯度下降算法是否收敛？

Question

我在 Python 中实现了一个单变量线性回归模型，它使用梯度下降来找到最佳拟合线的截距和斜率（我使用梯度下降而不是计算最优直接截距和斜率的值，因为我最终想推广到多元回归）。

我使用的数据如下。 sales 是因变量（美元），temp 是自变量（摄氏度）（想想冰淇淋销量与温度，或类似的东西）。

sales   temp
215     14.20
325     16.40
185     11.90
332     15.20
406     18.50
522     22.10
412     19.40
614     25.10
544     23.40
421     18.10
445     22.60
408     17.20

这是我标准化后的数据：

sales        temp 
0.06993007  0.174242424
0.326340326 0.340909091
0           0
0.342657343 0.25
0.515151515 0.5
0.785547786 0.772727273
0.529137529 0.568181818
1           1
0.836829837 0.871212121
0.55011655  0.46969697
0.606060606 0.810606061
0.51981352  0.401515152

我的算法代码：

import numpy as np
import pandas as pd
from scipy import stats

class SLRegression(object):
    def __init__(self, learnrate = .01, tolerance = .000000001, max_iter = 10000):

        # Initialize learnrate, tolerance, and max_iter.
        self.learnrate = learnrate
        self.tolerance = tolerance
        self.max_iter = max_iter

    # Define the gradient descent algorithm.
    def fit(self, data):
        # data   :   array-like, shape = [m_observations, 2_columns] 

        # Initialize local variables.
        converged = False
        m = data.shape[0]

        # Track number of iterations.
        self.iter_ = 0

        # Initialize theta0 and theta1.
        self.theta0_ = 0
        self.theta1_ = 0

        # Compute the cost function.
        J = (1.0/(2.0*m)) * sum([(self.theta0_ + self.theta1_*data[i][1] - data[i][0])**2 for i in range(m)])
        print('J is: ', J)

        # Iterate over each point in data and update theta0 and theta1 on each pass.
        while not converged:
            diftemp0 = (1.0/m) * sum([(self.theta0_ + self.theta1_*data[i][1] - data[i][0]) for i in range(m)])
            diftemp1 = (1.0/m) * sum([(self.theta0_ + self.theta1_*data[i][1] - data[i][0]) * data[i][1] for i in range(m)])

            # Subtract the learnrate * partial derivative from theta0 and theta1.
            temp0 = self.theta0_ - (self.learnrate * diftemp0)
            temp1 = self.theta1_ - (self.learnrate * diftemp1)

            # Update theta0 and theta1.
            self.theta0_ = temp0
            self.theta1_ = temp1

            # Compute the updated cost function, given new theta0 and theta1.
            new_J = (1.0/(2.0*m)) * sum([(self.theta0_ + self.theta1_*data[i][1] - data[i][0])**2 for i in range(m)])
            print('New J is: %s') % (new_J)

            # Test for convergence.
            if abs(J - new_J) <= self.tolerance:
                converged = True
                print('Model converged after %s iterations!') % (self.iter_)

            # Set old cost equal to new cost and update iter.
            J = new_J
            self.iter_ += 1

            # Test whether we have hit max_iter.
            if self.iter_ == self.max_iter:
                converged = True
                print('Maximum iterations have been reached!')

        return self

    def point_forecast(self, x):
        # Given feature value x, returns the regression's predicted value for y.
        return self.theta0_ + self.theta1_ * x


# Run the algorithm on a data set.
if __name__ == '__main__':
    # Load in the .csv file.
    data = np.squeeze(np.array(pd.read_csv('sales_normalized.csv')))

    # Create a regression model with the default learning rate, tolerance, and maximum number of iterations.
    slregression = SLRegression()

    # Call the fit function and pass in the data.
    slregression.fit(data)

    # Print out the results.
    print('After %s iterations, the model converged on Theta0 = %s and Theta1 = %s.') % (slregression.iter_, slregression.theta0_, slregression.theta1_)
    # Compare our model to scipy linregress model.
    slope, intercept, r_value, p_value, slope_std_error = stats.linregress(data[:,1], data[:,0])
    print('Scipy linear regression gives intercept: %s and slope = %s.') % (intercept, slope)

    # Test the model with a point forecast.
    print('As an example, our algorithm gives y = %s given x = .87.') % (slregression.point_forecast(.87)) # Should be about .83.
    print('The true y-value for x = .87 is about .8368.')

我无法准确理解是什么让算法收敛，而 return 值是完全错误的。给定 learnrate = .01、tolerance = .0000000001 和 max_iter = 10000，结合归一化数据，我可以使梯度下降算法收敛。但是，当我使用未归一化的数据时，没有算法 returning NaN 的最小学习率是 .005。这将成本函数从迭代到迭代的变化降低到 614 左右，但我无法让它降低。

这种类型的算法是否确实需要规范化数据，如果是，为什么？此外，考虑到算法需要标准化值，采用非标准化形式的小说 x-value 并将其插入点预测的最佳方法是什么？例如，如果我要将这个算法提供给客户，这样他们就可以做出自己的预测（我不是，但为了争论..），我难道不希望他们能够简单地插入在非标准化 x-value?

总而言之，玩弄 tolerance、max_iter 和 learnrate 大多数时候都会给我不收敛的结果。这是正常现象，还是我的算法中存在导致此问题的缺陷？

Answer 1

Given learnrate = .01, tolerance = .0000000001, and max_iter = 10000, in combination with normalized data, I can get the gradient descent algorithm to converge. However, when I use the un-normalized data, the smallest I can make the learning rate without the algorithm returning NaN is .005

您设置算法的方式是可以预料到的。

数据的归一化使得最佳拟合的 y 截距为大约 0.0。否则，您的 y 轴截距可能比初始猜测大出数千个单位，并且在您真正开始优化部分之前必须跋涉到那里。

Is it definitely a requirement of this type of algorithm to have normalized data, and if so, why?

不，绝对不是，但如果你不归一化，你应该更聪明地选择一个起点（你从 (m,b) = (0,0) 开始）。如果您不对数据进行归一化，您的学习率也可能太小，这与您的容忍度相同。

Also, what would be the best way to take a novel x-value in non-normalized form and plug it into the point forecast, given that the algorithm needs normalized values?

应用您对原始数据应用的任何转换，将规范化数据应用于您的新 x 值。（规范化代码不在您所展示的范围内）。如果这个测试点落在原始数据的 (minx,maxx) 范围内，一旦转换，它应该落在 0 <= x <= 1 内。一旦你有了这个标准化测试点，将它插入你的 theta 方程（请记住，您的 theta 是直线方程的 y 截距形式的 m,b）。

All and all, playing around with the tolerance, max_iter, and learnrate gives me non-convergent results the majority of the time.

对于一个格式良好的问题，如果你实际上发散，这通常意味着你的步长太大。尝试降低它。

如果它在达到最大迭代次数之前根本没有收敛，那可能是一些问题：

你的步长太小了，
你的容忍度太小了，
您的最大迭代次数太小，
你的出发点选错了

在您的情况下，使用非标准化数据会导致 (0,0) 的起点非常遥远（非标准化数据的 (m,b) 在 (-159, 30) 附近而标准化数据的 (m,b) 是 (0.10,0.79))，所以大多数（如果不是全部）迭代都用于到达感兴趣的区域。

这样做的问题是，通过增加步长以更快地到达感兴趣的区域也使得它更少-到达那里后找到收敛的可能性。

为了解决这个问题，一些梯度下降算法具有动态步长（或学习率），以便在开始时采用较大的步长，并在接近收敛时采用较小的步长。

在整个算法中保留 theta 对的历史记录，然后绘制它们可能对您有所帮助。您将能够立即看到使用规范化和非规范化输入数据之间的区别。

是什么决定了我的 Python 梯度下降算法是否收敛？

What determines whether my Python gradient descent algorithm converges?

python

algorithm

linear-regression

gradient-descent

convergence