反向传播实现 - 如何在矩阵上应用链式法则

Question

`dL/dX` 的梯度使用链式法则

提供L是神经网络的loss，X是输入，Y是点积的输出Y = X•W = np.dot(X, W).

根据链式法则，dL/dX → dY/dX • dL/dY → W.T • dL/dY 因为 dY/dX = W 产品 Y = X•W。

问题 1

如何将链式法则公式 dL/dX → W.T • dL/dY 应用于矩阵？由于 W.T 是 (4, 3) 而 dL/dY 是 (4,).

的形状不匹配，因此简单地应用它是行不通的

我可以应用什么想法、原理或转变来克服这个问题？我认为矩阵需要不同的想法。

        # gradient dy (dL/dY) back-propagated from the posterior layer
        dy = self.posterior.backward()    

        # Apply chain-rule dL/dX = dY/dX @ dL/dY where dY/dX = W.T
        dx = np.dot(self.w.T, dy)

注意：图中有错别字。 (,4) 是 (4,) 等。在我的大脑，4 元素的一维数组是 (,4) 但在 NumPy 中，它是 (4,).

问题 2

必须转置 W.T 和 X.T 才能使链式法则起作用的基本原理是什么？如果我转置 dL/dY，我想我可以使用 W 而无需转置，但请帮助理解。

对于`dL/dX`

我看到一个答案是交换位置，但不知道它来自哪里以及为什么。为什么改变链式法则中元素的顺序就可以了？

        # dL/dX = dL/dY • W.T instead of W.T • dL/dY 
        dx = np.dot(dy, self.w.T)   # dy(4,) @ w.T(4, 3) -> (3,)

对于`dL/dW`

在答案的下图中，X.T (,3) 和 dL/dY (, 4) 的形状被转换为 (3, 1) 和 (1, 4) 以匹配形状（实际上 (2,1) and (1,3) 但要与上面的快照保持一致），但不确定它来自何处以及背后的基本原理是什么。

回答

deep-learning-from-scratch/common/layers.py

    def backward(self, dout):
        dx = np.dot(dout, self.W.T)
        self.dW = np.dot(self.x.T, dout)
        self.db = np.sum(dout, axis=0)
        
        dx = dx.reshape(*self.original_x_shape)  # 入力データの形状に戻す（テンソル対応）
        return dx

代码

正在编码中，未经测试，不会工作。

class Affine(object):
    """Affine (MatMul) Layer"""
    def __init__(self, units, weights, optimizer, posteriors: List[object]):
        """Initialize the affine layer.
        
        [X] shape(size, n)
        Aka Batch. An array of input data x with n features (n: 0, 1, ..., n). n=0 is a bias.
        j-th input X[j] is [x(j)(0), x(j)(1), ... x(j)(n)] where bias 'x(j)(0)' is 1.
        Use capital X for batch and x for its individual input.
        
        NOTE: "input" is not limited to the first input data layer e.g. image pixels, 
              but "input" at any layer.

        [weights] shape(n, units)
        k-th neuron (k:0, 1, .. size-1) has its weight vector W(k):[w(k)(0), w(k)(1), ... w(k)(n)].
        w(k)(0) is its bias weight. Each w(k)(i) amplifies i-th feature in the input x.  
                
        Args:
            units: number of neurons in the layer
            weights: array of weight-vectors of each neuron. shape(n, size)
            optimizer: gradient descent implementation e.g SGD, Adam.
            posteriors: next layers
        """
        # neuron weight vectors
        self.w: numpy.ndarray = weights  # weight vector per neuron
        self.n: int = weights.shape[0]   # number of features expected
        self.dw: numpy.ndarray = None    # gradient of W
        
        self.X: numpy.ndarray = np.empty(0, self.n)     # Batch input
        self.m: int  = -1                # batch size: X.shape[0]

        self.posterior = posteriors[0]
        
        
    def forward(self, X):
        """Forward propagation of the affine layer X@W"""
        # X@W from X(m, n) @ W(n, units) to generate output Y(m, units)
        self.m = self.X.shape[0] if self.X is not None else -1
        Y = np.dot(self.X, self.w)
        self.posterior.forward(Y)


    def backward(self):
        # --------------------------------------------------------------------------------
        # Back propagation dy from the posterior layer. dy shape must match that of Y(m, units)
        # --------------------------------------------------------------------------------
        dy = self.posterior.backward()    # gradient back-propagated from the posterior 
        assert(dy.shape[0] == self.m), \
        "gradient dy shape {} must match output Y shape ({}, {})".format(
            dy.shape, self.m, self.n
        )

        # --------------------------------------------------------------------------------
        # Gradient descent on W
        # --------------------------------------------------------------------------------
        dw = np.dot(self.X.T, dy)
        self.w = self.optimizer.(self.w, dw)

        dx = np.dot(dy, self.w)
        return dx

几何

在我的理解中，X•W 通过几何截断 X 的其他维度来提取 X 的 W 维度部分。如果是这样，dL/dX 和 dL/dW 正在恢复截断的维度？不确定这是正确的，但如果是这样，是否可以将其可视化为图表中的 X•W 投影？

Answer 1

感谢@Reti43 for pointing to the reference. The detail math is provided by the cs231 Justin Jonson (now in Michigan University) as http://cs231n.stanford.edu/handouts/linear-backprop.pdf which is also available as Backpropagation for a Linear Layer。

cs231n lecture 4 解释了这个想法。

从步骤(5)到(6)的数学计算似乎是一个飞跃，因为点积不会由两个二维矩阵得出，而numpy.dot会产生矩阵乘法，如np.matmul，因此它不会是点积。

numpy function to use for mathematical dot product to produce scalar 中的答案解决了一个问题。

我通过阅读 Justin Johnson 的论文得到的理解如下。

权重向量 W 的格式

需要注意 Justin Johnson W 的权重表示。

在 Coursera ML 课程中，Andrew Ng 使用行向量来捕获节点的权重。当输入到层的特征数为 n 时，行向量大小为 n.

Justin Johnson 使用行向量来表示层大小，即层中的节点数。因此，如果层中有 m 个节点，则行向量大小为 m.

因此，Andrew Ng 的权重矩阵是 m x n，这意味着 m 行权重向量，每行都是特定节点的 n 个特征的权重。 Justin Johnson 的权重矩阵是 n x m，表示 n 行权重向量，每个向量是层中每个特征的 m 个节点的权重。

我想 Justin Johnson 认为 layer is a function 而 Andrew Ng 认为 node is a function。

因为我首先学习了 Andrew Ng 的 ML 课程，所以我使用了 weight vector per node 方法，结果是 W as m x n matrix。我的困惑来自将 W = m x n 应用于 Justin Jhonson 的论文。

维度分析

第一帧渐变 dimensions/shapes。

导出梯度

使用简单的单输入记录 X shape(d,)，推导出 dL/dX 并将其扩展为二维输入 X shape(n, d)，从而得到 W.T @ dL/dY.

dL/dX

对X和W使用行序矩阵，结果和cs321不一样，因为权重W的组织方式不一样

dL/dW.T

对X和W使用行序矩阵，结果和cs321不一样，因为权重W的组织方式不一样

如果有不正确的地方，非常感谢任何反馈。

反向传播实现 - 如何在矩阵上应用链式法则

backpropagation implementation - how to apply chain rule on matrix

python

numpy

backpropagation

`dL/dX` 的梯度使用链式法则

问题 1

问题 2

对于`dL/dX`

对于`dL/dW`

代码

相关

几何

权重向量 W 的格式

维度分析

导出梯度

dL/dX

dL/dW.T

反向传播实现 - 如何在矩阵上应用链式法则

backpropagation implementation - how to apply chain rule on matrix

python

numpy

backpropagation

dL/dX 的梯度使用链式法则

问题 1

问题 2

对于dL/dX

对于dL/dW

代码

相关

几何

权重向量 W 的格式

维度分析

导出梯度

dL/dX

dL/dW.T

`dL/dX` 的梯度使用链式法则

对于`dL/dX`

对于`dL/dW`