了解 gpu 使用 huggingface 分类

understanding gpu usage huggingface classification

我正在使用 huggingface 构建一个分类器,并且想从下面理解 Total train batch size (w. parallel, distributed & accumulation) = 64

 Num examples = 7000
  Num Epochs = 3
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 16
  Total optimization steps = 327

我有 7000 行数据,我已将纪元定义为 3, per_device_train_batch_size = 4per_device_eval_batch_size= 16。我也明白 Total optimization steps = 327 - (7000*3/64)

但是我不清楚Total train batch size (w. parallel, distributed & accumulation) = 64。 16*4(Instantaneous batch size per device = 4)等于64是不是表示有16台设备?

好吧,用于打印该摘要的变量是这个:https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py#L1211

总训练批量大小定义为 train_batch_size * gradient_accumulation_steps * world_size,因此在您的情况下 4 * 16 * 1 = 64world_size 始终为 1,除非您并行使用 TPU/training,请参阅 https://github.com/huggingface/transformers/blob/master/src/transformers/training_args.py#L1127