梯度下降的线性回归:两个问题
Linear Regression with gradient descent: two questions
我正在尝试理解使用梯度下降的线性回归,但我不理解下面 loss_gradients
函数中的这一部分。
import numpy as np
def forward_linear_regression(X, y, weights):
# dot product weights * inputs
N = np.dot(X, weights['W'])
# add bias
P = N + weights['B']
# compute loss with MSE
loss = np.mean(np.power(y - P, 2))
forward_info = {}
forward_info['X'] = X
forward_info['N'] = N
forward_info['P'] = P
forward_info['y'] = y
return loss, forward_info
这里是我理解的地方,我已经评论了我的问题:
def loss_gradients(forward_info, weights):
# to update weights, we need: dLdW = dLdP * dPdN * dNdW
dLdP = -2 * (forward_info['y'] - forward_info['P'])
dPdN = np.ones_like(forward_info['N'])
dNdW = np.transpose(forward_info['X'], (1, 0))
dLdW = np.dot(dNdW, dLdP * dPdN)
# why do we mix matrix multiplication and dot product like this?
# Why not dLdP * dPdN * dNdW instead?
# to update biases, we need: dLdB = dLdP * dPdB
dPdB = np.ones_like(forward_info[weights['B']])
dLdB = np.sum(dLdP * dPdB, axis=0)
# why do we sum those values along axis 0?
# why not just dLdP * dPdB ?
在我看来,此代码需要 'batch' 数据。我的意思是,当您执行 forward_info
和 loss_gradients
时,您实际上是在传递一堆 (X, y) 对。假设您传递 B 这样的对。所有前向信息内容的第一个维度的大小为 B。
现在,您的两个问题的答案是相同的:本质上,这些行计算梯度(使用您预测的公式)对于每个 B 项,并且然后总结所有的梯度,这样你就得到了一个梯度更新。我鼓励您自己弄清楚点积背后的逻辑,因为这是 ML 中非常常见的模式,但一开始要掌握它有点棘手。
我正在尝试理解使用梯度下降的线性回归,但我不理解下面 loss_gradients
函数中的这一部分。
import numpy as np
def forward_linear_regression(X, y, weights):
# dot product weights * inputs
N = np.dot(X, weights['W'])
# add bias
P = N + weights['B']
# compute loss with MSE
loss = np.mean(np.power(y - P, 2))
forward_info = {}
forward_info['X'] = X
forward_info['N'] = N
forward_info['P'] = P
forward_info['y'] = y
return loss, forward_info
这里是我理解的地方,我已经评论了我的问题:
def loss_gradients(forward_info, weights):
# to update weights, we need: dLdW = dLdP * dPdN * dNdW
dLdP = -2 * (forward_info['y'] - forward_info['P'])
dPdN = np.ones_like(forward_info['N'])
dNdW = np.transpose(forward_info['X'], (1, 0))
dLdW = np.dot(dNdW, dLdP * dPdN)
# why do we mix matrix multiplication and dot product like this?
# Why not dLdP * dPdN * dNdW instead?
# to update biases, we need: dLdB = dLdP * dPdB
dPdB = np.ones_like(forward_info[weights['B']])
dLdB = np.sum(dLdP * dPdB, axis=0)
# why do we sum those values along axis 0?
# why not just dLdP * dPdB ?
在我看来,此代码需要 'batch' 数据。我的意思是,当您执行 forward_info
和 loss_gradients
时,您实际上是在传递一堆 (X, y) 对。假设您传递 B 这样的对。所有前向信息内容的第一个维度的大小为 B。
现在,您的两个问题的答案是相同的:本质上,这些行计算梯度(使用您预测的公式)对于每个 B 项,并且然后总结所有的梯度,这样你就得到了一个梯度更新。我鼓励您自己弄清楚点积背后的逻辑,因为这是 ML 中非常常见的模式,但一开始要掌握它有点棘手。