PyTorch 中 BatchNorm2d 的导数
Derivative of BatchNorm2d in PyTorch
在我的网络中,我想在正向传播中计算我的网络的正向传播和反向传播。
为此,我必须手动定义前向传递层的所有后向传递方法。
对于激活函数,这很简单。对于线性层和转换层,它也运行良好。但我真的在与 BatchNorm 作斗争。由于 BatchNorm 论文仅讨论一维情况:
到目前为止,我的实现如下所示:
def backward_batchnorm2d(input, output, grad_output, layer):
gamma = layer.weight
beta = layer.bias
avg = layer.running_mean
var = layer.running_var
eps = layer.eps
B = input.shape[0]
# avg, var, gamma and beta are of shape [channel_size]
# while input, output, grad_output are of shape [batch_size, channel_size, w, h]
# for my calculations I have to reshape avg, var, gamma and beta to [batch_size, channel_size, w, h] by repeating the channel values over the whole image and batches
dL_dxi_hat = grad_output * gamma
dL_dvar = (-0.5 * dL_dxi_hat * (input - avg) / ((var + eps) ** 1.5)).sum((0, 2, 3), keepdim=True)
dL_davg = (-1.0 / torch.sqrt(var + eps) * dL_dxi_hat).sum((0, 2, 3), keepdim=True) + dL_dvar * (-2.0 * (input - avg)).sum((0, 2, 3), keepdim=True) / B
dL_dxi = dL_dxi_hat / torch.sqrt(var + eps) + 2.0 * dL_dvar * (input - avg) / B + dL_davg / B # dL_dxi_hat / sqrt()
dL_dgamma = (grad_output * output).sum((0, 2, 3), keepdim=True)
dL_dbeta = (grad_output).sum((0, 2, 3), keepdim=True)
return dL_dxi, dL_dgamma, dL_dbeta
当我用 torch.autograd.grad() 检查梯度时,我注意到 dL_dgamma
和 dL_dbeta
是正确的,但是 dL_dxi
是不正确的(很多)。但我找不到我的错误。我的错误在哪里?
作为参考,这里是 BatchNorm 的定义:
下面是一维情况的导数公式:
def backward_batchnorm2d(input, output, grad_output, layer):
gamma = layer.weight
gamma = gamma.view(1,-1,1,1) # edit
# beta = layer.bias
# avg = layer.running_mean
# var = layer.running_var
eps = layer.eps
B = input.shape[0] * input.shape[2] * input.shape[3] # edit
# add new
mean = input.mean(dim = (0,2,3), keepdim = True)
variance = input.var(dim = (0,2,3), unbiased=False, keepdim = True)
x_hat = (input - mean)/(torch.sqrt(variance + eps))
dL_dxi_hat = grad_output * gamma
# dL_dvar = (-0.5 * dL_dxi_hat * (input - avg) / ((var + eps) ** 1.5)).sum((0, 2, 3), keepdim=True)
# dL_davg = (-1.0 / torch.sqrt(var + eps) * dL_dxi_hat).sum((0, 2, 3), keepdim=True) + dL_dvar * (-2.0 * (input - avg)).sum((0, 2, 3), keepdim=True) / B
dL_dvar = (-0.5 * dL_dxi_hat * (input - mean)).sum((0, 2, 3), keepdim=True) * ((variance + eps) ** -1.5) # edit
dL_davg = (-1.0 / torch.sqrt(variance + eps) * dL_dxi_hat).sum((0, 2, 3), keepdim=True) + (dL_dvar * (-2.0 * (input - mean)).sum((0, 2, 3), keepdim=True) / B) #edit
dL_dxi = (dL_dxi_hat / torch.sqrt(variance + eps)) + (2.0 * dL_dvar * (input - mean) / B) + (dL_davg / B) # dL_dxi_hat / sqrt()
# dL_dgamma = (grad_output * output).sum((0, 2, 3), keepdim=True)
dL_dgamma = (grad_output * x_hat).sum((0, 2, 3), keepdim=True) # edit
dL_dbeta = (grad_output).sum((0, 2, 3), keepdim=True)
return dL_dxi, dL_dgamma, dL_dbeta
- 因为你没有上传你的正向代码,所以如果你的伽马形状大小是
1
,你需要将它重新整形为[1,gamma.shape[0],1,1]
。
- 公式遵循 1D,其中比例因子是批量大小的总和。然而,在 2D 中,总和应该在 3 个维度之间,所以
B = input.shape[0] * input.shape[2] * input.shape[3]
.
-
running_mean
和 running_var
仅在 test/inference 模式下使用,我们不会在训练中使用它们(您可以在 the paper 中找到它)。您需要的均值和方差是根据输入计算得出的,您可以将均值、方差和 x_hat = (x-mean)/sqrt(variance + eps)
存储到您的对象 layer
中,或者像我在上面的代码中所做的那样重新计算 # add new
。然后将它们替换为dL_dvar, dL_davg, dL_dxi
. 的公式
- 你的
dL_dgamma
应该是不正确的,因为你把output
的梯度自己乘了,应该修改成grad_output * x_hat
.
在我的网络中,我想在正向传播中计算我的网络的正向传播和反向传播。
为此,我必须手动定义前向传递层的所有后向传递方法。
对于激活函数,这很简单。对于线性层和转换层,它也运行良好。但我真的在与 BatchNorm 作斗争。由于 BatchNorm 论文仅讨论一维情况:
到目前为止,我的实现如下所示:
def backward_batchnorm2d(input, output, grad_output, layer):
gamma = layer.weight
beta = layer.bias
avg = layer.running_mean
var = layer.running_var
eps = layer.eps
B = input.shape[0]
# avg, var, gamma and beta are of shape [channel_size]
# while input, output, grad_output are of shape [batch_size, channel_size, w, h]
# for my calculations I have to reshape avg, var, gamma and beta to [batch_size, channel_size, w, h] by repeating the channel values over the whole image and batches
dL_dxi_hat = grad_output * gamma
dL_dvar = (-0.5 * dL_dxi_hat * (input - avg) / ((var + eps) ** 1.5)).sum((0, 2, 3), keepdim=True)
dL_davg = (-1.0 / torch.sqrt(var + eps) * dL_dxi_hat).sum((0, 2, 3), keepdim=True) + dL_dvar * (-2.0 * (input - avg)).sum((0, 2, 3), keepdim=True) / B
dL_dxi = dL_dxi_hat / torch.sqrt(var + eps) + 2.0 * dL_dvar * (input - avg) / B + dL_davg / B # dL_dxi_hat / sqrt()
dL_dgamma = (grad_output * output).sum((0, 2, 3), keepdim=True)
dL_dbeta = (grad_output).sum((0, 2, 3), keepdim=True)
return dL_dxi, dL_dgamma, dL_dbeta
当我用 torch.autograd.grad() 检查梯度时,我注意到 dL_dgamma
和 dL_dbeta
是正确的,但是 dL_dxi
是不正确的(很多)。但我找不到我的错误。我的错误在哪里?
作为参考,这里是 BatchNorm 的定义:
下面是一维情况的导数公式:
def backward_batchnorm2d(input, output, grad_output, layer):
gamma = layer.weight
gamma = gamma.view(1,-1,1,1) # edit
# beta = layer.bias
# avg = layer.running_mean
# var = layer.running_var
eps = layer.eps
B = input.shape[0] * input.shape[2] * input.shape[3] # edit
# add new
mean = input.mean(dim = (0,2,3), keepdim = True)
variance = input.var(dim = (0,2,3), unbiased=False, keepdim = True)
x_hat = (input - mean)/(torch.sqrt(variance + eps))
dL_dxi_hat = grad_output * gamma
# dL_dvar = (-0.5 * dL_dxi_hat * (input - avg) / ((var + eps) ** 1.5)).sum((0, 2, 3), keepdim=True)
# dL_davg = (-1.0 / torch.sqrt(var + eps) * dL_dxi_hat).sum((0, 2, 3), keepdim=True) + dL_dvar * (-2.0 * (input - avg)).sum((0, 2, 3), keepdim=True) / B
dL_dvar = (-0.5 * dL_dxi_hat * (input - mean)).sum((0, 2, 3), keepdim=True) * ((variance + eps) ** -1.5) # edit
dL_davg = (-1.0 / torch.sqrt(variance + eps) * dL_dxi_hat).sum((0, 2, 3), keepdim=True) + (dL_dvar * (-2.0 * (input - mean)).sum((0, 2, 3), keepdim=True) / B) #edit
dL_dxi = (dL_dxi_hat / torch.sqrt(variance + eps)) + (2.0 * dL_dvar * (input - mean) / B) + (dL_davg / B) # dL_dxi_hat / sqrt()
# dL_dgamma = (grad_output * output).sum((0, 2, 3), keepdim=True)
dL_dgamma = (grad_output * x_hat).sum((0, 2, 3), keepdim=True) # edit
dL_dbeta = (grad_output).sum((0, 2, 3), keepdim=True)
return dL_dxi, dL_dgamma, dL_dbeta
- 因为你没有上传你的正向代码,所以如果你的伽马形状大小是
1
,你需要将它重新整形为[1,gamma.shape[0],1,1]
。 - 公式遵循 1D,其中比例因子是批量大小的总和。然而,在 2D 中,总和应该在 3 个维度之间,所以
B = input.shape[0] * input.shape[2] * input.shape[3]
. -
running_mean
和running_var
仅在 test/inference 模式下使用,我们不会在训练中使用它们(您可以在 the paper 中找到它)。您需要的均值和方差是根据输入计算得出的,您可以将均值、方差和x_hat = (x-mean)/sqrt(variance + eps)
存储到您的对象layer
中,或者像我在上面的代码中所做的那样重新计算# add new
。然后将它们替换为dL_dvar, dL_davg, dL_dxi
. 的公式
- 你的
dL_dgamma
应该是不正确的,因为你把output
的梯度自己乘了,应该修改成grad_output * x_hat
.