反向传播(Andrew Ng 的 Cousera ML)梯度下降澄清
Backpropagation (Cousera ML by Andrew Ng) gradient descent clarification
问题
请原谅我问 Coursera ML 课程的具体问题。希望做过couser的人能回答一下。
在Coursera ML Week 4 Multi-class Classification and Neural Networks assignment中,为什么权重(theta)梯度是加(加)导数而不是减?
% Calculate the gradients of Weight2
% Derivative at Loss function J=L(Z) : dJ/dZ = (oi-yi)/oi(1-oi)
% Derivative at Sigmoid activation function dZ/dY = oi(1-oi)
delta_theta2 = oi - yi; % <--- (dJ/dZ) * (dZ/dY)
# Using +/plus NOT -/minus
Theta2_grad = Theta2_grad + <-------- Why plus(+)?
bsxfun(@times, hi, transpose(delta_theta2));
代码摘录
for i = 1:m
% i is training set index of X (including bias). X(i, :) is 401 data.
xi = X(i, :);
yi = Y(i, :);
% hi is the i th output of the hidden layer. H(i, :) is 26 data.
hi = H(i, :);
% oi is the i th output layer. O(i, :) is 10 data.
oi = O(i, :);
%------------------------------------------------------------------------
% Calculate the gradients of Theta2
%------------------------------------------------------------------------
delta_theta2 = oi - yi;
Theta2_grad = Theta2_grad + bsxfun(@times, hi, transpose(delta_theta2));
%------------------------------------------------------------------------
% Calculate the gradients of Theta1
%------------------------------------------------------------------------
% Derivative of g(z): g'(z)=g(z)(1-g(z)) where g(z) is sigmoid(H_NET).
dgz = (hi .* (1 - hi));
delta_theta1 = dgz .* sum(bsxfun(@times, Theta2, transpose(delta_theta2)));
% There is no input into H0, hence there is no theta for H0. Remove H0.
delta_theta1 = delta_theta1(2:end);
Theta1_grad = Theta1_grad + bsxfun(@times, xi, transpose(delta_theta1));
end
我以为是减导数
由于梯度是通过对所有训练示例的梯度进行平均计算的,因此我们首先“累积”梯度,同时遍历所有训练示例。我们通过对所有训练示例的梯度求和来做到这一点。因此,您用加号突出显示的行不是渐变更新步骤。 (注意 alpha 也不在那里。)它可能在其他地方。它很可能在从 1 到 m 的循环之外。
另外,我不确定你什么时候会知道这个(我确定它在课程的某个地方),但你也可以向量化代码:)
问题
请原谅我问 Coursera ML 课程的具体问题。希望做过couser的人能回答一下。
在Coursera ML Week 4 Multi-class Classification and Neural Networks assignment中,为什么权重(theta)梯度是加(加)导数而不是减?
% Calculate the gradients of Weight2
% Derivative at Loss function J=L(Z) : dJ/dZ = (oi-yi)/oi(1-oi)
% Derivative at Sigmoid activation function dZ/dY = oi(1-oi)
delta_theta2 = oi - yi; % <--- (dJ/dZ) * (dZ/dY)
# Using +/plus NOT -/minus
Theta2_grad = Theta2_grad + <-------- Why plus(+)?
bsxfun(@times, hi, transpose(delta_theta2));
代码摘录
for i = 1:m
% i is training set index of X (including bias). X(i, :) is 401 data.
xi = X(i, :);
yi = Y(i, :);
% hi is the i th output of the hidden layer. H(i, :) is 26 data.
hi = H(i, :);
% oi is the i th output layer. O(i, :) is 10 data.
oi = O(i, :);
%------------------------------------------------------------------------
% Calculate the gradients of Theta2
%------------------------------------------------------------------------
delta_theta2 = oi - yi;
Theta2_grad = Theta2_grad + bsxfun(@times, hi, transpose(delta_theta2));
%------------------------------------------------------------------------
% Calculate the gradients of Theta1
%------------------------------------------------------------------------
% Derivative of g(z): g'(z)=g(z)(1-g(z)) where g(z) is sigmoid(H_NET).
dgz = (hi .* (1 - hi));
delta_theta1 = dgz .* sum(bsxfun(@times, Theta2, transpose(delta_theta2)));
% There is no input into H0, hence there is no theta for H0. Remove H0.
delta_theta1 = delta_theta1(2:end);
Theta1_grad = Theta1_grad + bsxfun(@times, xi, transpose(delta_theta1));
end
我以为是减导数
由于梯度是通过对所有训练示例的梯度进行平均计算的,因此我们首先“累积”梯度,同时遍历所有训练示例。我们通过对所有训练示例的梯度求和来做到这一点。因此,您用加号突出显示的行不是渐变更新步骤。 (注意 alpha 也不在那里。)它可能在其他地方。它很可能在从 1 到 m 的循环之外。
另外,我不确定你什么时候会知道这个(我确定它在课程的某个地方),但你也可以向量化代码:)