局部排名在分布式深度学习中意味着什么？

Question

https://github.com/huggingface/transformers/blob/master/examples/run_glue.py

我想修改此脚本以对我的数据进行文本分类。用于此任务的计算机是一台带有两个图形卡的机器。因此，这涉及在上面的脚本中使用术语 local_rank 进行 "distributed" 训练，尤其是当 local_rank 等于 0 或 -1 时，如第 83 行。

阅读了一些分布式计算的资料后，我猜想 local_rank 就像机器的 ID。而0可能意味着这台机器是计算中的"main"或"head"。但是-1是什么？

Answer 1

Q: But what is -1?

一般用于关闭分布式设置。确实，如您所见 here:

train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)

和here：

if args.local_rank != -1:
    model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
                                                      output_device=args.local_rank,
                                                      find_unused_parameters=True)

将 local_rank 设置为 -1 具有此效果。

Answer 2

我想为@Berriel 的回答添加更多内容。由于你有两个 GPU 而不是具有节点结构的分布式机器，因此你不需要像 DistributedSampler 这样的分布式方法。 Hugginface 使用-1 禁用训练机制中的分布式设置。

从 huggiface training_args.py 脚本中检查以下代码。可以看到有没有分布式训练机制self.local_rank改过来

def _setup_devices(self) -> "torch.device":

      
        logger.info("PyTorch: setting up devices")
        if self.no_cuda:
            device = torch.device("cpu")
            self._n_gpu = 0
        elif is_torch_tpu_available():
            device = xm.xla_device()
            self._n_gpu = 0
        elif is_sagemaker_distributed_available():
            import smdistributed.dataparallel.torch.distributed as dist

            dist.init_process_group()
            self.local_rank = dist.get_local_rank()
            device = torch.device("cuda", self.local_rank)
            self._n_gpu = 1
        elif self.local_rank == -1:
            # if n_gpu is > 1 we'll use nn.DataParallel.
            # If you only want to use a specific subset of GPUs use `CUDA_VISIBLE_DEVICES=0`
            # Explicitly set CUDA to the first (index 0) CUDA device, otherwise `set_device` will
            # trigger an error that a device index is missing. Index 0 takes into account the
            # GPUs available in the environment, so `CUDA_VISIBLE_DEVICES=1,2` with `cuda:0`
            # will use the first GPU in that env, i.e. GPU#1
            device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
            # Sometimes the line in the postinit has not been run before we end up here, so just checking we're not at
            # the default value.
            self._n_gpu = torch.cuda.device_count()

局部排名在分布式深度学习中意味着什么？

What does local rank mean in distributed deep learning?

distributed-computing

deep-learning

pytorch