查看 Hugging Face Sagemaker 模型的训练误差指标
View train error metrics for Hugging Face Sagemaker model
我已经使用 Hugging Face 与 Amazon Sagemaker 的集成训练了一个模型 and their Hello World example。
我可以通过在训练模型上调用 training_job_analytics
轻松计算和查看在评估测试集上生成的指标:准确性、f 分数、精确度、召回率等:huggingface_estimator.training_job_analytics.dataframe()
我怎样才能在训练集上看到相同的指标(甚至每个时期的训练误差)?
训练代码与 link 基本相同,添加了额外的文档部分:
from sagemaker.huggingface import HuggingFace
# optionally parse logs for key metrics
# from the docs: https://huggingface.co/docs/sagemaker/train#sagemaker-metrics
metric_definitions = [
{'Name': 'loss', 'Regex': "'loss': ([0-9]+(.|e\-)[0-9]+),?"},
{'Name': 'learning_rate', 'Regex': "'learning_rate': ([0-9]+(.|e\-)[0-9]+),?"},
{'Name': 'eval_loss', 'Regex': "'eval_loss': ([0-9]+(.|e\-)[0-9]+),?"},
{'Name': 'eval_accuracy', 'Regex': "'eval_accuracy': ([0-9]+(.|e\-)[0-9]+),?"},
{'Name': 'eval_f1', 'Regex': "'eval_f1': ([0-9]+(.|e\-)[0-9]+),?"},
{'Name': 'eval_precision', 'Regex': "'eval_precision': ([0-9]+(.|e\-)[0-9]+),?"},
{'Name': 'eval_recall', 'Regex': "'eval_recall': ([0-9]+(.|e\-)[0-9]+),?"},
{'Name': 'eval_runtime', 'Regex': "'eval_runtime': ([0-9]+(.|e\-)[0-9]+),?"},
{'Name': 'eval_samples_per_second', 'Regex': "'eval_samples_per_second': ([0-9]+(.|e\-)[0-9]+),?"},
{'Name': 'epoch', 'Regex': "'epoch': ([0-9]+(.|e\-)[0-9]+),?"}
]
# hyperparameters, which are passed into the training job
hyperparameters={
'epochs': 5,
'train_batch_size': batch_size,
'model_name': model_checkpoint,
'task': task,
}
# init the model (but not yet trained)
huggingface_estimator = HuggingFace(
entry_point='train.py',
source_dir='./scripts',
instance_type='ml.p3.2xlarge',
instance_count=1,
role=role,
transformers_version='4.6',
pytorch_version='1.7',
py_version='py36',
hyperparameters = hyperparameters,
metric_definitions=metric_definitions
)
# starting the train job with our uploaded datasets as input
huggingface_estimator.fit({'train': training_input_path, 'test': test_input_path})
# does not return metrics on training - only on eval!
huggingface_estimator.training_job_analytics.dataframe()
这可以通过将训练中的轮数增加到更现实的值来解决。
目前,模型的训练时间不到 300 秒(这是记录以下时间戳的时间),大概是损失函数。
要进行的更改:
hyperparameters={
'epochs': 100, # increase the number of epochs to realistic value!
'train_batch_size': batch_size,
'model_name': model_checkpoint,
'task': task,
}
我已经使用 Hugging Face 与 Amazon Sagemaker 的集成训练了一个模型 and their Hello World example。
我可以通过在训练模型上调用 training_job_analytics
轻松计算和查看在评估测试集上生成的指标:准确性、f 分数、精确度、召回率等:huggingface_estimator.training_job_analytics.dataframe()
我怎样才能在训练集上看到相同的指标(甚至每个时期的训练误差)?
训练代码与 link 基本相同,添加了额外的文档部分:
from sagemaker.huggingface import HuggingFace
# optionally parse logs for key metrics
# from the docs: https://huggingface.co/docs/sagemaker/train#sagemaker-metrics
metric_definitions = [
{'Name': 'loss', 'Regex': "'loss': ([0-9]+(.|e\-)[0-9]+),?"},
{'Name': 'learning_rate', 'Regex': "'learning_rate': ([0-9]+(.|e\-)[0-9]+),?"},
{'Name': 'eval_loss', 'Regex': "'eval_loss': ([0-9]+(.|e\-)[0-9]+),?"},
{'Name': 'eval_accuracy', 'Regex': "'eval_accuracy': ([0-9]+(.|e\-)[0-9]+),?"},
{'Name': 'eval_f1', 'Regex': "'eval_f1': ([0-9]+(.|e\-)[0-9]+),?"},
{'Name': 'eval_precision', 'Regex': "'eval_precision': ([0-9]+(.|e\-)[0-9]+),?"},
{'Name': 'eval_recall', 'Regex': "'eval_recall': ([0-9]+(.|e\-)[0-9]+),?"},
{'Name': 'eval_runtime', 'Regex': "'eval_runtime': ([0-9]+(.|e\-)[0-9]+),?"},
{'Name': 'eval_samples_per_second', 'Regex': "'eval_samples_per_second': ([0-9]+(.|e\-)[0-9]+),?"},
{'Name': 'epoch', 'Regex': "'epoch': ([0-9]+(.|e\-)[0-9]+),?"}
]
# hyperparameters, which are passed into the training job
hyperparameters={
'epochs': 5,
'train_batch_size': batch_size,
'model_name': model_checkpoint,
'task': task,
}
# init the model (but not yet trained)
huggingface_estimator = HuggingFace(
entry_point='train.py',
source_dir='./scripts',
instance_type='ml.p3.2xlarge',
instance_count=1,
role=role,
transformers_version='4.6',
pytorch_version='1.7',
py_version='py36',
hyperparameters = hyperparameters,
metric_definitions=metric_definitions
)
# starting the train job with our uploaded datasets as input
huggingface_estimator.fit({'train': training_input_path, 'test': test_input_path})
# does not return metrics on training - only on eval!
huggingface_estimator.training_job_analytics.dataframe()
这可以通过将训练中的轮数增加到更现实的值来解决。
目前,模型的训练时间不到 300 秒(这是记录以下时间戳的时间),大概是损失函数。
要进行的更改:
hyperparameters={
'epochs': 100, # increase the number of epochs to realistic value!
'train_batch_size': batch_size,
'model_name': model_checkpoint,
'task': task,
}