模型采用分布式数据并行的两倍内存占用
Model takes twice the memory footprint with distributed data parallel
我有一个模型可以在单个 GPU 上很好地训练。但是当我切换到 Pytorch 分布式数据并行 (DDP) 时,我遇到了 CUDA 内存错误。具体来说,与没有并行的模型相比,DDP 模型占用的内存是其两倍。这是一个最小的可重现示例:
import os
from torch.nn.parallel import DistributedDataParallel as DDP
import torch.distributed as dist
import torch.multiprocessing as mp
import torch
def train(rank, gpu_list, train_distributed):
device_id = gpu_list[rank]
model = torch.nn.Linear(1000, 1000)
print(device_id, torch.cuda.memory_allocated(device_id))
model.to(device_id)
print(device_id, torch.cuda.memory_allocated(device_id))
print(device_id, torch.cuda.memory_allocated(device_id))
if train_distributed:
# convert model to DDP
dist.init_process_group("gloo", rank=rank, world_size=len(gpu_list))
model = DDP(model, device_ids=[device_id], find_unused_parameters=False)
print(device_id, torch.cuda.memory_allocated(device_id))
def train_distributed():
gpu_list = [torch.device(i) for i in [5, 6]]
os.environ['MASTER_ADDR'] = '127.0.01'
os.environ['MASTER_PORT'] = '7676'
mp.spawn(train, args=(gpu_list, True), nprocs=len(gpu_list), join=True)
if __name__ == '__main__':
# First test one GPU
train(0, [torch.device(5)], False)
# Then test multiple GPUs
train_distributed()
输出 - 请注意,当切换到 DDP 时,两个设备上的 GPU 使用率都会翻倍:
cuda:5 0
cuda:5 4004352
cuda:5 4004352
cuda:5 4004352
cuda:5 0
cuda:6 0
cuda:5 4004352
cuda:5 4004352
cuda:6 4004352
cuda:6 4004352
cuda:5 8008704
cuda:6 8008704
为什么模型在DDP中占用了两倍space?这是有意的行为吗?有没有办法避免这种额外的内存使用?
我在这里添加 PyTorch 论坛中写的 @ptrblck 的解决方案。
引用两句。
[...] the allocated memory get doubled when torch.distributed.Reducer
is instantiated in the constructor of DistributedDataParallel
和 answer:
[...] the Reducer
will create gradient buckets for each parameter, so that the memory usage after wrapping the model into DDP
will be 2x model_parameter_size
. Note that the parameter size of a model is often much smaller than the activation size so that this memory increase might or might not be significant
所以,从这里我们可以看出内存占用有时会翻倍的原因。
尝试使用 gradient_as_bucket_view
来节省内存。正如 document 所说,
gradient_as_bucket_view (bool) – 当设置为 True 时,梯度将是指向 allreduce 通信桶的不同偏移量的视图。这可以减少峰值内存使用量,其中节省的内存大小将等于总梯度大小。此外,它避免了在梯度和 allreduce 通信桶之间进行复制的开销。当渐变是视图时,不能在渐变上调用 detach_()。如果遇到此类错误,请参考torch/optim/optimizer.py中的zero_grad()函数解决。
我有一个模型可以在单个 GPU 上很好地训练。但是当我切换到 Pytorch 分布式数据并行 (DDP) 时,我遇到了 CUDA 内存错误。具体来说,与没有并行的模型相比,DDP 模型占用的内存是其两倍。这是一个最小的可重现示例:
import os
from torch.nn.parallel import DistributedDataParallel as DDP
import torch.distributed as dist
import torch.multiprocessing as mp
import torch
def train(rank, gpu_list, train_distributed):
device_id = gpu_list[rank]
model = torch.nn.Linear(1000, 1000)
print(device_id, torch.cuda.memory_allocated(device_id))
model.to(device_id)
print(device_id, torch.cuda.memory_allocated(device_id))
print(device_id, torch.cuda.memory_allocated(device_id))
if train_distributed:
# convert model to DDP
dist.init_process_group("gloo", rank=rank, world_size=len(gpu_list))
model = DDP(model, device_ids=[device_id], find_unused_parameters=False)
print(device_id, torch.cuda.memory_allocated(device_id))
def train_distributed():
gpu_list = [torch.device(i) for i in [5, 6]]
os.environ['MASTER_ADDR'] = '127.0.01'
os.environ['MASTER_PORT'] = '7676'
mp.spawn(train, args=(gpu_list, True), nprocs=len(gpu_list), join=True)
if __name__ == '__main__':
# First test one GPU
train(0, [torch.device(5)], False)
# Then test multiple GPUs
train_distributed()
输出 - 请注意,当切换到 DDP 时,两个设备上的 GPU 使用率都会翻倍:
cuda:5 0
cuda:5 4004352
cuda:5 4004352
cuda:5 4004352
cuda:5 0
cuda:6 0
cuda:5 4004352
cuda:5 4004352
cuda:6 4004352
cuda:6 4004352
cuda:5 8008704
cuda:6 8008704
为什么模型在DDP中占用了两倍space?这是有意的行为吗?有没有办法避免这种额外的内存使用?
我在这里添加 PyTorch 论坛中写的 @ptrblck 的解决方案。
引用两句。
[...] the allocated memory get doubled when
torch.distributed.Reducer
is instantiated in the constructor ofDistributedDataParallel
和 answer:
[...] the
Reducer
will create gradient buckets for each parameter, so that the memory usage after wrapping the model intoDDP
will be 2xmodel_parameter_size
. Note that the parameter size of a model is often much smaller than the activation size so that this memory increase might or might not be significant
所以,从这里我们可以看出内存占用有时会翻倍的原因。
尝试使用 gradient_as_bucket_view
来节省内存。正如 document 所说,
gradient_as_bucket_view (bool) – 当设置为 True 时,梯度将是指向 allreduce 通信桶的不同偏移量的视图。这可以减少峰值内存使用量,其中节省的内存大小将等于总梯度大小。此外,它避免了在梯度和 allreduce 通信桶之间进行复制的开销。当渐变是视图时,不能在渐变上调用 detach_()。如果遇到此类错误,请参考torch/optim/optimizer.py中的zero_grad()函数解决。