了解 gpu 使用 huggingface 分类
understanding gpu usage huggingface classification
我正在使用 huggingface 构建一个分类器,并且想从下面理解 Total train batch size (w. parallel, distributed & accumulation) = 64
行
Num examples = 7000
Num Epochs = 3
Instantaneous batch size per device = 4
Total train batch size (w. parallel, distributed & accumulation) = 64
Gradient Accumulation steps = 16
Total optimization steps = 327
我有 7000 行数据,我已将纪元定义为 3, per_device_train_batch_size = 4
和 per_device_eval_batch_size= 16
。我也明白 Total optimization steps = 327
- (7000*3/64)
但是我不清楚Total train batch size (w. parallel, distributed & accumulation) = 64
。 16*4(Instantaneous batch size per device = 4
)等于64是不是表示有16台设备?
好吧,用于打印该摘要的变量是这个:https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py#L1211。
总训练批量大小定义为 train_batch_size * gradient_accumulation_steps * world_size
,因此在您的情况下 4 * 16 * 1 = 64
。 world_size
始终为 1,除非您并行使用 TPU/training,请参阅 https://github.com/huggingface/transformers/blob/master/src/transformers/training_args.py#L1127。
我正在使用 huggingface 构建一个分类器,并且想从下面理解 Total train batch size (w. parallel, distributed & accumulation) = 64
行
Num examples = 7000
Num Epochs = 3
Instantaneous batch size per device = 4
Total train batch size (w. parallel, distributed & accumulation) = 64
Gradient Accumulation steps = 16
Total optimization steps = 327
我有 7000 行数据,我已将纪元定义为 3, per_device_train_batch_size = 4
和 per_device_eval_batch_size= 16
。我也明白 Total optimization steps = 327
- (7000*3/64)
但是我不清楚Total train batch size (w. parallel, distributed & accumulation) = 64
。 16*4(Instantaneous batch size per device = 4
)等于64是不是表示有16台设备?
好吧,用于打印该摘要的变量是这个:https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py#L1211。
总训练批量大小定义为 train_batch_size * gradient_accumulation_steps * world_size
,因此在您的情况下 4 * 16 * 1 = 64
。 world_size
始终为 1,除非您并行使用 TPU/training,请参阅 https://github.com/huggingface/transformers/blob/master/src/transformers/training_args.py#L1127。