模型采用分布式数据并行的两倍内存占用

Question

我有一个模型可以在单个 GPU 上很好地训练。但是当我切换到 Pytorch 分布式数据并行 (DDP) 时，我遇到了 CUDA 内存错误。具体来说，与没有并行的模型相比，DDP 模型占用的内存是其两倍。这是一个最小的可重现示例：

import os
from torch.nn.parallel import DistributedDataParallel as DDP
import torch.distributed as dist
import torch.multiprocessing as mp
import torch

def train(rank, gpu_list, train_distributed):
    
    device_id = gpu_list[rank]

    model = torch.nn.Linear(1000, 1000)
    print(device_id, torch.cuda.memory_allocated(device_id))
    model.to(device_id)
    print(device_id, torch.cuda.memory_allocated(device_id))

    print(device_id, torch.cuda.memory_allocated(device_id))
    if train_distributed:
        # convert model to DDP
        dist.init_process_group("gloo", rank=rank, world_size=len(gpu_list))
        model = DDP(model, device_ids=[device_id], find_unused_parameters=False)
    print(device_id, torch.cuda.memory_allocated(device_id))

def train_distributed():
    gpu_list = [torch.device(i) for i in [5, 6]]
    os.environ['MASTER_ADDR'] = '127.0.01'
    os.environ['MASTER_PORT'] = '7676'
    mp.spawn(train, args=(gpu_list, True), nprocs=len(gpu_list), join=True)

if __name__ == '__main__':
    # First test one GPU
    train(0, [torch.device(5)], False)

    # Then test multiple GPUs
    train_distributed()

输出 - 请注意，当切换到 DDP 时，两个设备上的 GPU 使用率都会翻倍：

cuda:5 0
cuda:5 4004352
cuda:5 4004352
cuda:5 4004352
cuda:5 0
cuda:6 0
cuda:5 4004352
cuda:5 4004352
cuda:6 4004352
cuda:6 4004352
cuda:5 8008704
cuda:6 8008704

为什么模型在DDP中占用了两倍space？这是有意的行为吗？有没有办法避免这种额外的内存使用？

Answer 1

我在这里添加 PyTorch 论坛中写的 @ptrblck 的解决方案。

引用两句。

statement:

[...] the allocated memory get doubled when torch.distributed.Reducer is instantiated in the constructor of DistributedDataParallel

和 answer:

[...] the Reducer will create gradient buckets for each parameter, so that the memory usage after wrapping the model into DDP will be 2x model_parameter_size. Note that the parameter size of a model is often much smaller than the activation size so that this memory increase might or might not be significant

所以，从这里我们可以看出内存占用有时会翻倍的原因。

Answer 2

尝试使用 gradient_as_bucket_view 来节省内存。正如 document 所说，

gradient_as_bucket_view (bool) – 当设置为 True 时，梯度将是指向 allreduce 通信桶的不同偏移量的视图。这可以减少峰值内存使用量，其中节省的内存大小将等于总梯度大小。此外，它避免了在梯度和 allreduce 通信桶之间进行复制的开销。当渐变是视图时，不能在渐变上调用 detach_()。如果遇到此类错误，请参考torch/optim/optimizer.py中的zero_grad()函数解决。

模型采用分布式数据并行的两倍内存占用

Model takes twice the memory footprint with distributed data parallel

parallel-processing

distributed

pytorch