累积梯度
Accumulating Gradients
我想在进行向后传递之前累积梯度。所以想知道正确的做法是什么。根据 this article
它是:
model.zero_grad() # Reset gradients tensors
for i, (inputs, labels) in enumerate(training_set):
predictions = model(inputs) # Forward pass
loss = loss_function(predictions, labels) # Compute loss function
loss = loss / accumulation_steps # Normalize our loss (if averaged)
loss.backward() # Backward pass
if (i+1) % accumulation_steps == 0: # Wait for several backward steps
optimizer.step() # Now we can do an optimizer step
model.zero_grad()
而我的预期是:
model.zero_grad() # Reset gradients tensors
loss = 0
for i, (inputs, labels) in enumerate(training_set):
predictions = model(inputs) # Forward pass
loss += loss_function(predictions, labels) # Compute loss function
if (i+1) % accumulation_steps == 0: # Wait for several backward steps
loss = loss / accumulation_steps # Normalize our loss (if averaged)
loss.backward() # Backward pass
optimizer.step() # Now we can do an optimizer step
model.zero_grad()
loss = 0
我在这里累加损失,然后除以累加步数取平均。
第二个问题,如果我是对的,考虑到我只在每个累积步骤中进行向后传递,你会期望我的方法更快吗?
向后传递 loss.backward()
是实际计算 梯度 的操作。
如果你只做正向传播(predictions = model(inputs)
),不会计算梯度,因此不可能有累加。
所以根据回答here,第一种方法内存效率高。两种方法所需的工作量大致相同。
第二种方法不断累积图形,因此需要 accumulation_steps
倍的内存。第一种方法直接计算梯度(并简单地添加梯度)因此需要更少的内存。
我想在进行向后传递之前累积梯度。所以想知道正确的做法是什么。根据 this article 它是:
model.zero_grad() # Reset gradients tensors
for i, (inputs, labels) in enumerate(training_set):
predictions = model(inputs) # Forward pass
loss = loss_function(predictions, labels) # Compute loss function
loss = loss / accumulation_steps # Normalize our loss (if averaged)
loss.backward() # Backward pass
if (i+1) % accumulation_steps == 0: # Wait for several backward steps
optimizer.step() # Now we can do an optimizer step
model.zero_grad()
而我的预期是:
model.zero_grad() # Reset gradients tensors
loss = 0
for i, (inputs, labels) in enumerate(training_set):
predictions = model(inputs) # Forward pass
loss += loss_function(predictions, labels) # Compute loss function
if (i+1) % accumulation_steps == 0: # Wait for several backward steps
loss = loss / accumulation_steps # Normalize our loss (if averaged)
loss.backward() # Backward pass
optimizer.step() # Now we can do an optimizer step
model.zero_grad()
loss = 0
我在这里累加损失,然后除以累加步数取平均。
第二个问题,如果我是对的,考虑到我只在每个累积步骤中进行向后传递,你会期望我的方法更快吗?
向后传递 loss.backward()
是实际计算 梯度 的操作。
如果你只做正向传播(predictions = model(inputs)
),不会计算梯度,因此不可能有累加。
所以根据回答here,第一种方法内存效率高。两种方法所需的工作量大致相同。
第二种方法不断累积图形,因此需要 accumulation_steps
倍的内存。第一种方法直接计算梯度(并简单地添加梯度)因此需要更少的内存。