尽管使用 Kaiming 初始化，梯度消失

Question

我正在使用激活函数 (prelu) 在 pytorch 中实现一个转换块。我使用 Kaiming initilization 来初始化我所有的权重并将所有偏差设置为零。然而，当我测试这些块时（通过将 100 个这样的转换和激活块堆叠在一起），我注意到我得到的输出值约为 10^(-10)。这正常吗，考虑到我要堆叠多达 100 层。向每一层添加一个小偏差可以解决这个问题。但在 Kaiming 初始化中，偏差应该为零。

这是转换块代码

from collections import Iterable

def convBlock(
    input_channels, output_channels, kernel_size=3, padding=None, activation="prelu"
):
    """
    Initializes a conv block using Kaiming Initialization
    """
    padding_par = 0
    if padding == "same":
        padding_par = same_padding(kernel_size)
    conv = nn.Conv2d(input_channels, output_channels, kernel_size, padding=padding_par)
    relu_negative_slope = 0.25
    act = None
    if activation == "prelu" or activation == "leaky_relu":
        nn.init.kaiming_normal_(conv.weight, a=relu_negative_slope, mode="fan_in")
        if activation == "prelu":
            act = nn.PReLU(init=relu_negative_slope)
        else:
            act = nn.LeakyReLU(negative_slope=relu_negative_slope)
    if activation == "relu":
        nn.init.kaiming_normal_(conv.weight, nonlinearity="relu")
        act = nn.ReLU()
    nn.init.constant_(conv.bias.data, 0)
    block = nn.Sequential(conv, act)
    return block


def flatten(lis):
    for item in lis:
        if isinstance(item, Iterable) and not isinstance(item, str):
            for x in flatten(item):
                yield x
        else:
            yield item


def Sequential(args):
    flattened_args = list(flatten(args))
    return nn.Sequential(*flattened_args)

这是测试代码

ls=[]
for i in range(100):
    ls.append(convBlock(3,3,3,"same"))
model=Sequential(ls)

test=np.ones((1,3,5,5))
model(torch.Tensor(test))

我得到的输出是

tensor([[[[-1.7771e-10, -3.5088e-10,  5.9369e-09,  4.2668e-09,  9.8803e-10],
          [ 1.8657e-09, -4.0271e-10,  3.1189e-09,  1.5117e-09,  6.6546e-09],
          [ 2.4237e-09, -6.2249e-10, -5.7327e-10,  4.2867e-09,  6.0034e-09],
          [-1.8757e-10,  5.5446e-09,  1.7641e-09,  5.7018e-09,  6.4347e-09],
          [ 1.2352e-09, -3.4732e-10,  4.1553e-10, -1.2996e-09,  3.8971e-09]],

         [[ 2.6607e-09,  1.7756e-09, -1.0923e-09, -1.4272e-09, -1.1840e-09],
          [ 2.0668e-10, -1.8130e-09, -2.3864e-09, -1.7061e-09, -1.7147e-10],
          [-6.7161e-10, -1.3440e-09, -6.3196e-10, -8.7677e-10, -1.4851e-09],
          [ 3.1475e-09, -1.6574e-09, -3.4180e-09, -3.5224e-09, -2.6642e-09],
          [-1.9703e-09, -3.2277e-09, -2.4733e-09, -2.3707e-09, -8.7598e-10]],

         [[ 3.5573e-09,  7.8113e-09,  6.8232e-09,  1.2285e-09, -9.3973e-10],
          [ 6.6368e-09,  8.2877e-09,  9.2108e-10,  9.7531e-10,  7.0011e-10],
          [ 6.6954e-09,  9.1019e-09,  1.5128e-08,  3.3151e-09,  2.1899e-10],
          [ 1.2152e-08,  7.7002e-09,  1.6406e-08,  1.4948e-08, -6.0882e-10],
          [ 6.9930e-09,  7.3222e-09, -7.4308e-10,  5.2505e-09,  3.4365e-09]]]],
       grad_fn=<PreluBackward>)

Answer 1

好问题（欢迎来到 Whosebug）！ Research paper for quick reference.

TLDR

尝试更广泛的网络（64 个频道）
在激活后添加批量归一化（甚至在激活之前，应该不会有太大区别）
添加残差连接（不应比批量规范改进太多，不得已）

请按此顺序检查并发表评论哪些（以及是否）对您的情况有效（因为我也很好奇）。

你做的事情与众不同

你的神经网络很深，但很窄（每层只有 81 个参数！）

由于上述原因，由于样本太小，无法可靠地从正态分布中创建这些权重。

尝试更广泛的网络，64 个频道或更多

你正在尝试比他们做的更深的网络

部分：比较实验

We conducted comparisons on a deep but efficient model with 14 weight layers (actually 22 was also tested in comparison with Xavier)

这是由于本文的发布日期 (2015) 和“过去”的硬件限制（比方说）

这正常吗？

对于这种深度的层，方法本身很奇怪，至少目前是这样；

每个 conv 块通常之后是像 ReLU 和批量归一化 这样的激活（它归一化信号并有助于 exploding/vanishing 信号）
通常这个深度的网络（即使是你所拥有的深度的一半）也使用剩余连接（虽然这与 vanishing/small 信号没有直接联系，更多地与深度网络的退化问题有关, 比如 1000 层)

尽管使用 Kaiming 初始化，梯度消失

Gradients vanishing despite using Kaiming initialization

gradient

pytorch

TLDR

你做的事情与众不同

这正常吗？