为什么通过 torch.optim.SGD 方法学习率会发生变化?

Why does by torch.optim.SGD method learning rate change?

对于 SGD,学习率不应在 epoch 期间改变,但事实确实如此。请帮助我理解为什么会发生这种情况以及如何防止此 LR 更改?

import torch
params = [torch.nn.Parameter(torch.randn(1, 1))]
optimizer = torch.optim.SGD(params, lr=0.9)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1, gamma=0.9)
for epoch in range(5):
    print(scheduler.get_lr())
    scheduler.step()

输出为:

[0.9]
[0.7290000000000001]
[0.6561000000000001]
[0.5904900000000002]
[0.5314410000000002]

我的手电筒版本是1.4.0

由于您使用的是命令 torch.optim.lr_scheduler.StepLR(optimizer, 1, gamma=0.9)(实际上意味着 torch.optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=0.9)),因此您每 step_size=1 步将学习率乘以 gamma=0.9

  • 0.9 = 0.9
  • 0.729 = 0.9*0.9*0.9
  • 0.6561 = 0.9*0.9*0.9*0.9
  • 0.59049 = 0.9*0.9*0.9*0.9*0.9

唯一的 "strange" 点是它在第二步缺少 0.81=0.9*0.9(更新:参见 的解释)

为了防止过早减少,如果数据集中有 N 个样本,并且批量大小为 D,则设置 torch.optim.lr_scheduler.StepLR(optimizer, step_size=N/D, gamma=0.9) 在每个时期减少。减少每个E epoch集torch.optim.lr_scheduler.StepLR(optimizer, step_size=E*N/D, gamma=0.9)

这正是 torch.optim.lr_scheduler.StepLR 应该做的。它改变了学习率。来自 pytorch 文档:

Decays the learning rate of each parameter group by gamma every step_size epochs. Notice that such decay can happen simultaneously with other changes to the learning rate from outside this scheduler. When last_epoch=-1, sets initial lr as lr

如果您尝试优化 params,您的代码应该看起来更像这样(只是一个玩具示例,loss 的精确形式将取决于您的应用程序)

for epoch in range(5):
  optimizer.zero_grad()
  loss = (params[0]**2).sum()
  loss.backward()
  optimizer.step()

扩展 answer about "strange" behavior (0.81 is missing): It is PyTorch's default way since 1.1.0 release, check documentation,即这部分:

[...] If you use the learning rate scheduler (calling scheduler.step()) before the optimizer’s update (calling optimizer.step()), this will skip the first value of the learning rate schedule.

此外,您应该在第一次 get_lr() 调用后得到此函数抛出的 UserWarning,因为您根本没有调用 optimizer.step()