了解 gpu 使用 huggingface 分类

Question

我正在使用 huggingface 构建一个分类器，并且想从下面理解 Total train batch size (w. parallel, distributed & accumulation) = 64 行

 Num examples = 7000
  Num Epochs = 3
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 16
  Total optimization steps = 327

我有 7000 行数据，我已将纪元定义为 3， per_device_train_batch_size = 4 和 per_device_eval_batch_size= 16。我也明白 Total optimization steps = 327 - (7000*3/64)

但是我不清楚Total train batch size (w. parallel, distributed & accumulation) = 64。 16*4(Instantaneous batch size per device = 4)等于64是不是表示有16台设备？

Answer 1

好吧，用于打印该摘要的变量是这个：https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py#L1211。

总训练批量大小定义为 train_batch_size * gradient_accumulation_steps * world_size，因此在您的情况下 4 * 16 * 1 = 64。 world_size 始终为 1，除非您并行使用 TPU/training，请参阅 https://github.com/huggingface/transformers/blob/master/src/transformers/training_args.py#L1127。

了解 gpu 使用 huggingface 分类

understanding gpu usage huggingface classification

python

gpu

huggingface-transformers