下面梯度下降算法的迭代实现有什么错误？

Question

我曾尝试实现梯度下降算法的迭代版本，但它无法正常工作。然而，相同算法的矢量化实现工作正常。
这是迭代实现：

function [theta] = gradientDescent_i(X, y, theta, alpha, iterations)

    % get the number of rows and columns
    nrows = size(X, 1);
    ncols = size(X, 2);

    % initialize the hypothesis vector
    h = zeros(nrows, 1);

    % initialize the temporary theta vector
    theta_temp = zeros(ncols, 1);

    % run gradient descent for the specified number of iterations
    count = 1;

    while count <= iterations

        % calculate the hypothesis values and fill into the vector
        for i = 1 : nrows
            for j = 1 : ncols
                term = theta(j) * X(i, j);
                h(i) = h(i) + term;
            end
        end

        % calculate the gradient
        for j = 1 : ncols
            for i = 1 : nrows
                term = (h(i) - y(i)) * X(i, j);
                theta_temp(j) = theta_temp(j) + term;
            end
        end

        % update the gradient with the factor
        fact = alpha / nrows;

        for i = 1 : ncols
            theta_temp(i) = fact * theta_temp(i);
        end

        % update the theta
        for i = 1 : ncols
            theta(i) = theta(i) - theta_temp(i);
        end

        % update the count
        count += 1;
    end
end

下面是相同算法的矢量化实现：

function [theta, theta_all, J_cost] = gradientDescent(X, y, theta, alpha)

    % set the learning rate
    learn_rate = alpha;

    % set the number of iterations
    n = 1500;

    % number of training examples
    m = length(y);

    % initialize the theta_new vector
    l = length(theta);
    theta_new = zeros(l,1);

    % initialize the cost vector
    J_cost = zeros(n,1);

    % initialize the vector to store all the calculated theta values
    theta_all = zeros(n,2);

    % perform gradient descent for the specified number of iterations
    for i = 1 : n

        % calculate the hypothesis
        hypothesis = X * theta;

        % calculate the error
        err = hypothesis - y;

        % calculate the gradient
        grad = X' * err;

        % calculate the new theta
        theta_new = (learn_rate/m) .* grad;

        % update the old theta
        theta = theta - theta_new;

        % update the cost
        J_cost(i) = computeCost(X, y, theta);

        % store the calculated theta value
        if i < n
            index = i + 1;
            theta_all(index,:) = theta';
    end
end

Link可以查到数据集here

文件名为ex1data1.txt

问题

对于初始 theta = [0, 0]（这是一个向量！），学习率为 0.01 并且运行这对于 1500 次迭代我得到最佳 theta 为：

theta0 = -3.6303
theta1 = 1.1664

上面是我知道我已经正确实现的矢量化实现的输出（它通过了 Coursera 上的所有测试用例）。

但是，当我使用迭代方法（我提到的第一个代码）实现相同的算法时，我得到的 theta 值为（alpha = 0.01，迭代次数 = 1500）：

theta0 = -0.20720
theta1 = -0.77392

此实现未能通过测试用例，因此我知道该实现不正确。

然而，我无法理解我哪里出错了，因为迭代代码执行相同的工作，与矢量化代码相同的乘法，当我试图跟踪这两个代码的 1 次迭代的输出时，值来了相同（用笔和纸！）但是当我运行他们在 Octave 上时失败了。

任何关于此的帮助都会有很大的帮助，特别是如果你能指出我哪里出了问题以及失败的确切原因。

要考虑的要点

假设的实现是正确的，因为我对其进行了测试，并且两个代码给出了相同的结果，所以这里没有问题。
我在两个代码中都打印了梯度向量的输出，然后意识到错误就出在这里，因为这里的输出非常不同！

此外，这里是预处理数据的代码：

function[X, y] = fileReader(filename)

    % load the dataset
    dataset = load(filename);

    % get the dimensions of the dataset
    nrows = size(dataset, 1);
    ncols = size(dataset, 2);

    % generate the X matrix from the dataset
    X = dataset(:, 1 : ncols - 1);

    % generate the y vector
    y = dataset(:, ncols);

    % append 1's to the X matrix
    X = [ones(nrows, 1), X];
end

Answer 1

第一个代码的问题在于 theta_temp 和 h 向量未正确初始化。对于第一次迭代（当 count 值等于 1 时），您的代码运行正常，因为对于该特定迭代，h 和 theta_temp 向量已正确初始化为 0。然而，由于这些是梯度下降每次迭代的临时向量，因此在后续迭代中它们没有再次初始化为 0 向量。也就是说，对于迭代 2，修改为 h(i) 和 theta_temp(i) 的值只是添加到旧值。因此，代码无法正常工作。您需要在每次迭代开始时将向量更新为零向量，然后它们才能正常工作。这是我对你的代码的实现（第一个，观察变化）：

function [theta] = gradientDescent_i(X, y, theta, alpha, iterations)

    % get the number of rows and columns
    nrows = size(X, 1);
    ncols = size(X, 2);

    % run gradient descent for the specified number of iterations
    count = 1;

    while count <= iterations

        % initialize the hypothesis vector
        h = zeros(nrows, 1);

        % initialize the temporary theta vector
        theta_temp = zeros(ncols, 1);


        % calculate the hypothesis values and fill into the vector
        for i = 1 : nrows
            for j = 1 : ncols
                term = theta(j) * X(i, j);
                h(i) = h(i) + term;
            end
        end

        % calculate the gradient
        for j = 1 : ncols
            for i = 1 : nrows
                term = (h(i) - y(i)) * X(i, j);
                theta_temp(j) = theta_temp(j) + term;
            end
        end

        % update the gradient with the factor
        fact = alpha / nrows;

        for i = 1 : ncols
            theta_temp(i) = fact * theta_temp(i);
        end

        % update the theta
        for i = 1 : ncols
            theta(i) = theta(i) - theta_temp(i);
        end

        % update the count
        count += 1;
    end
end

我运行代码，它给出了与您提到的相同的 theta 值。但是，我想知道的是，您是如何说明假设向量的输出在两种情况下都是相同的，很明显，这是第一个代码失败的原因之一！

下面梯度下降算法的迭代实现有什么错误？

What is the error in the iterative implementation of gradient descent algorithm below?

matlab

regression

machine-learning

octave

gradient-descent