关于随机参数的损失梯度在 pytorch 中是相同的

Question

在下面的简单代码中，我对 1 的输入张量执行简单的线性运算，并计算其二元交叉熵损失，将零向量作为预期输出。在计算关于 w 的损失梯度时，行相同且等于关于 b 的梯度。这是违反直觉的，因为 w 和 b 具有随机值。这是什么原因？

n_input, n_output = 5, 3
x = torch.ones(n_input)
y = torch.zeros(n_output) # expected output
w = torch.randn(n_input, n_output, requires_grad=True) 
b = torch.randn(n_output, requires_grad=True)
z = torch.matmul(x,w) + b
loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y) 

loss.backward()
print(w.grad)
print(b.grad)

输出：

tensor([[0.2179, 0.4337, 0.1959],
        [0.2179, 0.4337, 0.1959],
        [0.2179, 0.4337, 0.1959],
        [0.2179, 0.4337, 0.1959],
        [0.2179, 0.4337, 0.1959]])
tensor([0.2179, 0.4337, 0.1959])

Answer 1

您有一个输入特征大小为 5 的数据点。如果你查看你执行的操作，你有 z = x@w + b，那么你有一个来自 logits 的二进制交叉熵针对空标签。二元交叉熵定义为：

bce = -[y_true*log(σ(y_pred)) + (1 - y_true)*log(1 - σ(y_pred))]

z的梯度写成偏导数dL/dz，它由三个元素组成（与z大小相同）假设[dz1, dz2, dz3]。

要计算权重参数 w 和偏差参数 b 的梯度，我们有以下内容：

dL/dw = x.T @ dL/dz
dL/db = dL/dz (with a shape change)

因此b.grad就是

[dz1, dz2, dz3]

而且，由于我们 x 由 1 组成，x.T @ dL/dz 最终成为一个矩阵，其行数也等于 dL/dz，即五行：

[[dz1, dz2, dz3],
 [dz1, dz2, dz3],
 [dz1, dz2, dz3],
 [dz1, dz2, dz3],
 [dz1, dz2, dz3]]

Answer 2

因为你的输入是对称的。

从感知器的角度想象这个问题（你的设置中有 3 个）：每个输入都是 1.0，因此特定神经元的权重无关紧要（从哪个输入获取并不重要，因为到处都是 1.0）。

如果您使输入多样化，一切都会正常：

    n_input, n_output = 5, 3
    x = torch.randn(n_input)
    y = torch.ones(n_output)/2.  # expected output
    w = torch.randn(n_input, n_output, requires_grad=True)
    b = torch.randn(n_output, requires_grad=True)
    z = torch.matmul(x, w) + b

    loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y)
    loss.backward()
    print(w.grad)
    print(b.grad)

    tensor([[-0.1939,  0.1657, -0.2501],
        [ 0.0561, -0.0480,  0.0724],
        [-0.3162,  0.2703, -0.4079],
        [ 0.0947, -0.0809,  0.1221],
        [-0.0140,  0.0120, -0.0181]])
    tensor([-0.1263,  0.1080, -0.1630])

关于随机参数的损失梯度在 pytorch 中是相同的

Gradients of loss with respect to random parameters are the same in pytorch

backpropagation

pytorch