了解 gpu 使用 huggingface 分类 - 总优化步骤
understanding gpu usage huggingface classification - Total optimization steps
我正在为分类问题训练 huggingface longformer,结果低于输出。
我对Total optimization steps
感到困惑。因为我有 7000 个训练数据点和 5 个时期和 Total train batch size (w. parallel, distributed & accumulation) = 64
,我不应该得到
7000*5/64
步? 546.875
?为什么显示 Total optimization steps = 545
为什么在下面的输出中,有16步Input ids are automatically padded from 1500 to 1536 to be a multiple of config.attention_window: 512
然后 [ 23/545 14:24 < 5:58:16, 0.02 it/s, Epoch 0.20/5]
?这些步骤是什么?
============================================= =============
***** Running training *****
Num examples = 7000
Num Epochs = 5
Instantaneous batch size per device = 4
Total train batch size (w. parallel, distributed & accumulation) = 64
Gradient Accumulation steps = 16
Total optimization steps = 545
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
[ 23/545 14:24 < 5:58:16, 0.02 it/s, Epoch 0.20/5]
Epoch Training Loss Validation Loss
#update
添加 Trainer
和 TrainingArguments
#class weights
class CustomTrainer(Trainer):
def compute_loss(self, model, inputs, return_outputs=False):
labels = inputs.get("labels")
# forward pass
outputs = model(**inputs)
logits = outputs.get("logits")
# compute custom loss (suppose one has 3 labels with different weights)
loss_fct = nn.CrossEntropyLoss(weight=torch.tensor([1.0, 0.5243])).to(device)
loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1)).to(device)
return (loss, outputs) if return_outputs else loss
trainer = CustomTrainer(
model=model,
args=training_args,
compute_metrics=compute_metrics,
train_dataset=train_df_tuning_dataset_tokenized,
eval_dataset=val_dataset_tokenized
)
# define the training arguments
training_args = TrainingArguments(
num_train_epochs = 5,# changed this from 5
per_device_train_batch_size = 4,#4,#8,
gradient_accumulation_steps = 16,
per_device_eval_batch_size= 16,#16
evaluation_strategy = "epoch",
save_strategy = "epoch",
learning_rate=2e-5,
load_best_model_at_end=True,
greater_is_better=False,
disable_tqdm = False,
weight_decay=0.01,
optim="adamw_torch",#removing on 18 march from huggingface example notebook
run_name = 'longformer-classification-16March2022'
)
1。为什么有 545 个优化步骤?
查看 transformers
包的实现,我们看到 Trainer
在 [=16] 中打印 Total optimization steps
消息时使用了一个名为 max_steps
的变量=]方法:
logger.info("***** Running training *****")
logger.info(f" Num examples = {num_examples}")
logger.info(f" Num Epochs = {num_train_epochs}")
logger.info(f" Instantaneous batch size per device = {args.per_device_train_batch_size}")
logger.info(f" Total train batch size (w. parallel, distributed & accumulation) = {total_train_batch_size}")
logger.info(f" Gradient Accumulation steps = {args.gradient_accumulation_steps}")
logger.info(f" Total optimization steps = {max_steps}")
Permalink to the above snippet in the transformers repo
Trainer
在 train
方法的前面有以下代码:
class Trainer:
[...]
def train(self) -> None:
[Some irrelevant code ommited here...]
total_train_batch_size = args.train_batch_size * args.gradient_accumulation_steps * args.world_size
if train_dataset_is_sized:
num_update_steps_per_epoch = len(train_dataloader) // args.gradient_accumulation_steps
num_update_steps_per_epoch = max(num_update_steps_per_epoch, 1)
if args.max_steps > 0:
max_steps = args.max_steps
num_train_epochs = args.max_steps // num_update_steps_per_epoch + int(
args.max_steps % num_update_steps_per_epoch > 0
)
# May be slightly incorrect if the last batch in the training datalaoder has a smaller size but it's
# the best we can do.
num_train_samples = args.max_steps * total_train_batch_size
else:
max_steps = math.ceil(args.num_train_epochs * num_update_steps_per_epoch)
num_train_epochs = math.ceil(args.num_train_epochs)
num_train_samples = len(self.train_dataset) * args.num_train_epochs
Permalink to the above snippet in the transformers repo
正如预期的那样,total_train_batch_size = args.train_batch_size * args.gradient_accumulation_steps * args.world_size
在您的示例中将等于 total_train_batch_size = 4 * 16 * 1 = 64
。
然后我们有 num_update_steps_per_epoch = len(train_dataloader) // args.gradient_accumulation_steps
这会给我们 num_update_steps_per_epoch = len(train_dataloader) // 16
.
现在 DataLoader
的长度等于 DataLoader
中的批次数。由于您有 7000 个样本,而我们的 per_device_train_batch_size
为 4,这将为我们提供 7000 / 4 = 1750
个批次。回到 num_update_steps_per_epoch
我们现在有 num_update_steps_per_epoch = 1750 // 16 = 109
(Python 整数除法发言)
您没有指定最大步数,因此我们到达 max_steps = math.ceil(args.num_train_epochs * num_update_steps_per_epoch)
,这给了我们 max_steps = math.ceil(5 * 109) = 545
。
2。为什么填充操作会被记录 16 次?
在变形金刚架构中,从技术上讲,您不必将 所有 样本填充为相同长度。真正重要的是批次中的样本长度相同,批次之间的长度可能不同。
这意味着这条消息将出现在每一个通过正向传递的批次中。至于为什么消息出现16次,其实23个batches实际上已经通过forward pass了,我可以想到两个可能的原因:
- 填充操作的记录和进度条的记录发生在两个不同的线程上,前者有点滞后
- (极不可能)您有不需要填充的批次,因为所有样本的长度都相同,并且该长度已经是 512 的倍数。
我正在为分类问题训练 huggingface longformer,结果低于输出。
我对
Total optimization steps
感到困惑。因为我有 7000 个训练数据点和 5 个时期和Total train batch size (w. parallel, distributed & accumulation) = 64
,我不应该得到7000*5/64
步?546.875
?为什么显示Total optimization steps = 545
为什么在下面的输出中,有16步
Input ids are automatically padded from 1500 to 1536 to be a multiple of config.attention_window: 512
然后[ 23/545 14:24 < 5:58:16, 0.02 it/s, Epoch 0.20/5]
?这些步骤是什么?
============================================= =============
***** Running training *****
Num examples = 7000
Num Epochs = 5
Instantaneous batch size per device = 4
Total train batch size (w. parallel, distributed & accumulation) = 64
Gradient Accumulation steps = 16
Total optimization steps = 545
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
Initializing global attention on CLS token...
Input ids are automatically padded from 1500 to 1536 to be a multiple of `config.attention_window`: 512
[ 23/545 14:24 < 5:58:16, 0.02 it/s, Epoch 0.20/5]
Epoch Training Loss Validation Loss
#update
添加 Trainer
和 TrainingArguments
#class weights
class CustomTrainer(Trainer):
def compute_loss(self, model, inputs, return_outputs=False):
labels = inputs.get("labels")
# forward pass
outputs = model(**inputs)
logits = outputs.get("logits")
# compute custom loss (suppose one has 3 labels with different weights)
loss_fct = nn.CrossEntropyLoss(weight=torch.tensor([1.0, 0.5243])).to(device)
loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1)).to(device)
return (loss, outputs) if return_outputs else loss
trainer = CustomTrainer(
model=model,
args=training_args,
compute_metrics=compute_metrics,
train_dataset=train_df_tuning_dataset_tokenized,
eval_dataset=val_dataset_tokenized
)
# define the training arguments
training_args = TrainingArguments(
num_train_epochs = 5,# changed this from 5
per_device_train_batch_size = 4,#4,#8,
gradient_accumulation_steps = 16,
per_device_eval_batch_size= 16,#16
evaluation_strategy = "epoch",
save_strategy = "epoch",
learning_rate=2e-5,
load_best_model_at_end=True,
greater_is_better=False,
disable_tqdm = False,
weight_decay=0.01,
optim="adamw_torch",#removing on 18 march from huggingface example notebook
run_name = 'longformer-classification-16March2022'
)
1。为什么有 545 个优化步骤?
查看 transformers
包的实现,我们看到 Trainer
在 [=16] 中打印 Total optimization steps
消息时使用了一个名为 max_steps
的变量=]方法:
logger.info("***** Running training *****")
logger.info(f" Num examples = {num_examples}")
logger.info(f" Num Epochs = {num_train_epochs}")
logger.info(f" Instantaneous batch size per device = {args.per_device_train_batch_size}")
logger.info(f" Total train batch size (w. parallel, distributed & accumulation) = {total_train_batch_size}")
logger.info(f" Gradient Accumulation steps = {args.gradient_accumulation_steps}")
logger.info(f" Total optimization steps = {max_steps}")
Permalink to the above snippet in the transformers repo
Trainer
在 train
方法的前面有以下代码:
class Trainer:
[...]
def train(self) -> None:
[Some irrelevant code ommited here...]
total_train_batch_size = args.train_batch_size * args.gradient_accumulation_steps * args.world_size
if train_dataset_is_sized:
num_update_steps_per_epoch = len(train_dataloader) // args.gradient_accumulation_steps
num_update_steps_per_epoch = max(num_update_steps_per_epoch, 1)
if args.max_steps > 0:
max_steps = args.max_steps
num_train_epochs = args.max_steps // num_update_steps_per_epoch + int(
args.max_steps % num_update_steps_per_epoch > 0
)
# May be slightly incorrect if the last batch in the training datalaoder has a smaller size but it's
# the best we can do.
num_train_samples = args.max_steps * total_train_batch_size
else:
max_steps = math.ceil(args.num_train_epochs * num_update_steps_per_epoch)
num_train_epochs = math.ceil(args.num_train_epochs)
num_train_samples = len(self.train_dataset) * args.num_train_epochs
Permalink to the above snippet in the transformers repo
正如预期的那样,total_train_batch_size = args.train_batch_size * args.gradient_accumulation_steps * args.world_size
在您的示例中将等于 total_train_batch_size = 4 * 16 * 1 = 64
。
然后我们有 num_update_steps_per_epoch = len(train_dataloader) // args.gradient_accumulation_steps
这会给我们 num_update_steps_per_epoch = len(train_dataloader) // 16
.
现在 DataLoader
的长度等于 DataLoader
中的批次数。由于您有 7000 个样本,而我们的 per_device_train_batch_size
为 4,这将为我们提供 7000 / 4 = 1750
个批次。回到 num_update_steps_per_epoch
我们现在有 num_update_steps_per_epoch = 1750 // 16 = 109
(Python 整数除法发言)
您没有指定最大步数,因此我们到达 max_steps = math.ceil(args.num_train_epochs * num_update_steps_per_epoch)
,这给了我们 max_steps = math.ceil(5 * 109) = 545
。
2。为什么填充操作会被记录 16 次?
在变形金刚架构中,从技术上讲,您不必将 所有 样本填充为相同长度。真正重要的是批次中的样本长度相同,批次之间的长度可能不同。
这意味着这条消息将出现在每一个通过正向传递的批次中。至于为什么消息出现16次,其实23个batches实际上已经通过forward pass了,我可以想到两个可能的原因:
- 填充操作的记录和进度条的记录发生在两个不同的线程上,前者有点滞后
- (极不可能)您有不需要填充的批次,因为所有样本的长度都相同,并且该长度已经是 512 的倍数。