局部排名在分布式深度学习中意味着什么?
What does local rank mean in distributed deep learning?
https://github.com/huggingface/transformers/blob/master/examples/run_glue.py
我想修改此脚本以对我的数据进行文本分类。用于此任务的计算机是一台带有两个图形卡的机器。因此,这涉及在上面的脚本中使用术语 local_rank
进行 "distributed" 训练,尤其是当 local_rank
等于 0 或 -1 时,如第 83 行。
阅读了一些分布式计算的资料后,我猜想 local_rank
就像机器的 ID。而0可能意味着这台机器是计算中的"main"或"head"。但是-1是什么?
Q: But what is -1?
一般用于关闭分布式设置。确实,如您所见 here:
train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
和here:
if args.local_rank != -1:
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
output_device=args.local_rank,
find_unused_parameters=True)
将 local_rank
设置为 -1
具有此效果。
我想为@Berriel 的回答添加更多内容。由于你有两个 GPU 而不是具有节点结构的分布式机器,因此你不需要像 DistributedSampler 这样的分布式方法。 Hugginface 使用-1 禁用训练机制中的分布式设置。
从 huggiface training_args.py 脚本中检查以下代码。可以看到有没有分布式训练机制self.local_rank改过来
def _setup_devices(self) -> "torch.device":
logger.info("PyTorch: setting up devices")
if self.no_cuda:
device = torch.device("cpu")
self._n_gpu = 0
elif is_torch_tpu_available():
device = xm.xla_device()
self._n_gpu = 0
elif is_sagemaker_distributed_available():
import smdistributed.dataparallel.torch.distributed as dist
dist.init_process_group()
self.local_rank = dist.get_local_rank()
device = torch.device("cuda", self.local_rank)
self._n_gpu = 1
elif self.local_rank == -1:
# if n_gpu is > 1 we'll use nn.DataParallel.
# If you only want to use a specific subset of GPUs use `CUDA_VISIBLE_DEVICES=0`
# Explicitly set CUDA to the first (index 0) CUDA device, otherwise `set_device` will
# trigger an error that a device index is missing. Index 0 takes into account the
# GPUs available in the environment, so `CUDA_VISIBLE_DEVICES=1,2` with `cuda:0`
# will use the first GPU in that env, i.e. GPU#1
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# Sometimes the line in the postinit has not been run before we end up here, so just checking we're not at
# the default value.
self._n_gpu = torch.cuda.device_count()
https://github.com/huggingface/transformers/blob/master/examples/run_glue.py
我想修改此脚本以对我的数据进行文本分类。用于此任务的计算机是一台带有两个图形卡的机器。因此,这涉及在上面的脚本中使用术语 local_rank
进行 "distributed" 训练,尤其是当 local_rank
等于 0 或 -1 时,如第 83 行。
阅读了一些分布式计算的资料后,我猜想 local_rank
就像机器的 ID。而0可能意味着这台机器是计算中的"main"或"head"。但是-1是什么?
Q: But what is -1?
一般用于关闭分布式设置。确实,如您所见 here:
train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
和here:
if args.local_rank != -1:
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
output_device=args.local_rank,
find_unused_parameters=True)
将 local_rank
设置为 -1
具有此效果。
我想为@Berriel 的回答添加更多内容。由于你有两个 GPU 而不是具有节点结构的分布式机器,因此你不需要像 DistributedSampler 这样的分布式方法。 Hugginface 使用-1 禁用训练机制中的分布式设置。
从 huggiface training_args.py 脚本中检查以下代码。可以看到有没有分布式训练机制self.local_rank改过来
def _setup_devices(self) -> "torch.device":
logger.info("PyTorch: setting up devices")
if self.no_cuda:
device = torch.device("cpu")
self._n_gpu = 0
elif is_torch_tpu_available():
device = xm.xla_device()
self._n_gpu = 0
elif is_sagemaker_distributed_available():
import smdistributed.dataparallel.torch.distributed as dist
dist.init_process_group()
self.local_rank = dist.get_local_rank()
device = torch.device("cuda", self.local_rank)
self._n_gpu = 1
elif self.local_rank == -1:
# if n_gpu is > 1 we'll use nn.DataParallel.
# If you only want to use a specific subset of GPUs use `CUDA_VISIBLE_DEVICES=0`
# Explicitly set CUDA to the first (index 0) CUDA device, otherwise `set_device` will
# trigger an error that a device index is missing. Index 0 takes into account the
# GPUs available in the environment, so `CUDA_VISIBLE_DEVICES=1,2` with `cuda:0`
# will use the first GPU in that env, i.e. GPU#1
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# Sometimes the line in the postinit has not been run before we end up here, so just checking we're not at
# the default value.
self._n_gpu = torch.cuda.device_count()