模型采用分布式数据并行的两倍内存占用

Model takes twice the memory footprint with distributed data parallel

我有一个模型可以在单个 GPU 上很好地训练。但是当我切换到 Pytorch 分布式数据并行 (DDP) 时,我遇到了 CUDA 内存错误。具体来说,与没有并行的模型相比,DDP 模型占用的内存是其两倍。这是一个最小的可重现示例:

import os
from torch.nn.parallel import DistributedDataParallel as DDP
import torch.distributed as dist
import torch.multiprocessing as mp
import torch

def train(rank, gpu_list, train_distributed):
    
    device_id = gpu_list[rank]

    model = torch.nn.Linear(1000, 1000)
    print(device_id, torch.cuda.memory_allocated(device_id))
    model.to(device_id)
    print(device_id, torch.cuda.memory_allocated(device_id))

    print(device_id, torch.cuda.memory_allocated(device_id))
    if train_distributed:
        # convert model to DDP
        dist.init_process_group("gloo", rank=rank, world_size=len(gpu_list))
        model = DDP(model, device_ids=[device_id], find_unused_parameters=False)
    print(device_id, torch.cuda.memory_allocated(device_id))

def train_distributed():
    gpu_list = [torch.device(i) for i in [5, 6]]
    os.environ['MASTER_ADDR'] = '127.0.01'
    os.environ['MASTER_PORT'] = '7676'
    mp.spawn(train, args=(gpu_list, True), nprocs=len(gpu_list), join=True)

if __name__ == '__main__':
    # First test one GPU
    train(0, [torch.device(5)], False)

    # Then test multiple GPUs
    train_distributed()

输出 - 请注意,当切换到 DDP 时,两个设备上的 GPU 使用率都会翻倍:

cuda:5 0
cuda:5 4004352
cuda:5 4004352
cuda:5 4004352
cuda:5 0
cuda:6 0
cuda:5 4004352
cuda:5 4004352
cuda:6 4004352
cuda:6 4004352
cuda:5 8008704
cuda:6 8008704

为什么模型在DDP中占用了两倍space?这是有意的行为吗?有没有办法避免这种额外的内存使用?

我在这里添加 PyTorch 论坛中写的 @ptrblck 的解决方案。

引用两句。

statement:

[...] the allocated memory get doubled when torch.distributed.Reducer is instantiated in the constructor of DistributedDataParallel

answer:

[...] the Reducer will create gradient buckets for each parameter, so that the memory usage after wrapping the model into DDP will be 2x model_parameter_size. Note that the parameter size of a model is often much smaller than the activation size so that this memory increase might or might not be significant

所以,从这里我们可以看出内存占用有时会翻倍的原因。

尝试使用 gradient_as_bucket_view 来节省内存。正如 document 所说,

gradient_as_bucket_view (bool) – 当设置为 True 时,梯度将是指向 allreduce 通信桶的不同偏移量的视图。这可以减少峰值内存使用量,其中节省的内存大小将等于总梯度大小。此外,它避免了在梯度和 allreduce 通信桶之间进行复制的开销。当渐变是视图时,不能在渐变上调用 detach_()。如果遇到此类错误,请参考torch/optim/optimizer.py中的zero_grad()函数解决。