使用 ml-engine returns 状态进行超参数调整:失败
Hyperparameter tuning with ml-engine returns State: failed
我正在尝试使用 ml-engine 调整我的模型超参数,但我不太确定它是否有效。
我没有在 HyperparameterSpec
中指定 algorithm
标签,根据文档,它应该默认为贝叶斯优化方法。我也没有设置 maxFailedTrials
,根据文档,如果第一个失败,应该结束所有路径。
这是我的配置
trainingInput:
scaleTier: CUSTOM
masterType: standard_gpu
hyperparameters:
goal: MAXIMIZE
maxTrials: 8
maxParallelTrials: 2
hyperparameterMetricTag: test_accuracy
params:
- parameterName: dropout_rate
type: DOUBLE
minValue: 0.3
maxValue: 0.7
scaleType: UNIT_LINEAR_SCALE
- parameterName: lr
type: DOUBLE
minValue: 0.0001
maxValue: 0.0003
scaleType: UNIT_LINEAR_SCALE
这是训练输出:
{
"completedTrialCount": "8",
"trials": [
{
"trialId": "1",
"hyperparameters": {
"lr": "0.00014959385395050048",
"dropout_rate": "0.42217149734497067"
},
"startTime": "2019-10-07T09:40:02.143968039Z",
"endTime": "2019-10-07T09:47:50Z",
"state": "FAILED"
},
{
"trialId": "2",
"hyperparameters": {
"dropout_rate": "0.62217149734497068",
"lr": "0.00028292718728383382"
},
"startTime": "2019-10-07T09:40:02.144192681Z",
"endTime": "2019-10-07T09:47:19Z",
"state": "FAILED"
},
{
"trialId": "3",
"hyperparameters": {
"lr": "0.00014846909046173097",
"dropout_rate": "0.31717863082885739"
},
"startTime": "2019-10-07T09:48:09.266596472Z",
"endTime": "2019-10-07T09:55:26Z",
"state": "FAILED"
},
{
"trialId": "4",
"hyperparameters": {
"lr": "0.00018741662502288819",
"dropout_rate": "0.34178204536437984"
},
"startTime": "2019-10-07T09:48:10.761305330Z",
"endTime": "2019-10-07T09:55:58Z",
"state": "FAILED"
},
{
"trialId": "5",
"hyperparameters": {
"dropout_rate": "0.6216828346252441",
"lr": "0.00010192830562591553"
},
"startTime": "2019-10-07T09:56:15.904704865Z",
"endTime": "2019-10-07T10:04:04Z",
"state": "FAILED"
},
{
"trialId": "6",
"hyperparameters": {
"dropout_rate": "0.42288427352905272",
"lr": "0.000230206298828125"
},
"startTime": "2019-10-07T09:56:17.895067636Z",
"endTime": "2019-10-07T10:04:05Z",
"state": "FAILED"
},
{
"trialId": "7",
"hyperparameters": {
"lr": "0.00019101441543291624",
"dropout_rate": "0.36415641310447144"
},
"startTime": "2019-10-07T10:05:22.147233194Z",
"endTime": "2019-10-07T10:13:09Z",
"state": "FAILED"
},
{
"trialId": "8",
"hyperparameters": {
"dropout_rate": "0.69955616224911532",
"lr": "0.00029989311482522672"
},
"startTime": "2019-10-07T10:05:22.147396438Z",
"endTime": "2019-10-07T10:13:30Z",
"state": "FAILED"
}
],
"consumedMLUnits": 2.29,
"isHyperparameterTuningJob": true,
"hyperparameterMetricTag": "test_accuracy"
}
所有路径都是 运行,所以我认为它的搜索算法由于某种原因失败了。我无法通过 运行 另一种冗长的方式从搜索算法中找到有关其 returns 这个或任何日志的更多信息。
对我来说,它似乎无法在 tensorflow 事件文件中找到指标,但我不明白为什么,因为名称完全相同,我可以用 tensorboard 打开事件文件查看数据。也许对日志结构有一些我不知道的要求?
记录指标的代码:
from tensorflow.contrib.summary import summary as summary_ops
# in __init__
self.tf_board_writer = summary_ops.create_file_writer(self.save_path)
....
# During training
with self.tf_board_writer.as_default(), summary_ops.always_record_summaries():
summary_ops.scalar(name=name, tensor=value, step=step)
如果 ml-engine 团队的任何人在这里结束了,现在 TF2 已经稳定并发布了,你知道它什么时候可以在 运行time 环境中使用吗?
无论如何,希望有人能帮助我:)
问题可以通过使用 python 包 cloudml-hypertune
和以下代码来解决:
self.hpt.report_hyperparameter_tuning_metric(
hyperparameter_metric_tag=hypeparam_metric_name,
metric_value=value,
global_step=step)
然后将HyperparameterSpec
中的hyperparameterMetricTag
设置为hypeparam_metric_name
我正在尝试使用 ml-engine 调整我的模型超参数,但我不太确定它是否有效。
我没有在 HyperparameterSpec
中指定 algorithm
标签,根据文档,它应该默认为贝叶斯优化方法。我也没有设置 maxFailedTrials
,根据文档,如果第一个失败,应该结束所有路径。
这是我的配置
trainingInput:
scaleTier: CUSTOM
masterType: standard_gpu
hyperparameters:
goal: MAXIMIZE
maxTrials: 8
maxParallelTrials: 2
hyperparameterMetricTag: test_accuracy
params:
- parameterName: dropout_rate
type: DOUBLE
minValue: 0.3
maxValue: 0.7
scaleType: UNIT_LINEAR_SCALE
- parameterName: lr
type: DOUBLE
minValue: 0.0001
maxValue: 0.0003
scaleType: UNIT_LINEAR_SCALE
这是训练输出:
{
"completedTrialCount": "8",
"trials": [
{
"trialId": "1",
"hyperparameters": {
"lr": "0.00014959385395050048",
"dropout_rate": "0.42217149734497067"
},
"startTime": "2019-10-07T09:40:02.143968039Z",
"endTime": "2019-10-07T09:47:50Z",
"state": "FAILED"
},
{
"trialId": "2",
"hyperparameters": {
"dropout_rate": "0.62217149734497068",
"lr": "0.00028292718728383382"
},
"startTime": "2019-10-07T09:40:02.144192681Z",
"endTime": "2019-10-07T09:47:19Z",
"state": "FAILED"
},
{
"trialId": "3",
"hyperparameters": {
"lr": "0.00014846909046173097",
"dropout_rate": "0.31717863082885739"
},
"startTime": "2019-10-07T09:48:09.266596472Z",
"endTime": "2019-10-07T09:55:26Z",
"state": "FAILED"
},
{
"trialId": "4",
"hyperparameters": {
"lr": "0.00018741662502288819",
"dropout_rate": "0.34178204536437984"
},
"startTime": "2019-10-07T09:48:10.761305330Z",
"endTime": "2019-10-07T09:55:58Z",
"state": "FAILED"
},
{
"trialId": "5",
"hyperparameters": {
"dropout_rate": "0.6216828346252441",
"lr": "0.00010192830562591553"
},
"startTime": "2019-10-07T09:56:15.904704865Z",
"endTime": "2019-10-07T10:04:04Z",
"state": "FAILED"
},
{
"trialId": "6",
"hyperparameters": {
"dropout_rate": "0.42288427352905272",
"lr": "0.000230206298828125"
},
"startTime": "2019-10-07T09:56:17.895067636Z",
"endTime": "2019-10-07T10:04:05Z",
"state": "FAILED"
},
{
"trialId": "7",
"hyperparameters": {
"lr": "0.00019101441543291624",
"dropout_rate": "0.36415641310447144"
},
"startTime": "2019-10-07T10:05:22.147233194Z",
"endTime": "2019-10-07T10:13:09Z",
"state": "FAILED"
},
{
"trialId": "8",
"hyperparameters": {
"dropout_rate": "0.69955616224911532",
"lr": "0.00029989311482522672"
},
"startTime": "2019-10-07T10:05:22.147396438Z",
"endTime": "2019-10-07T10:13:30Z",
"state": "FAILED"
}
],
"consumedMLUnits": 2.29,
"isHyperparameterTuningJob": true,
"hyperparameterMetricTag": "test_accuracy"
}
所有路径都是 运行,所以我认为它的搜索算法由于某种原因失败了。我无法通过 运行 另一种冗长的方式从搜索算法中找到有关其 returns 这个或任何日志的更多信息。
对我来说,它似乎无法在 tensorflow 事件文件中找到指标,但我不明白为什么,因为名称完全相同,我可以用 tensorboard 打开事件文件查看数据。也许对日志结构有一些我不知道的要求?
记录指标的代码:
from tensorflow.contrib.summary import summary as summary_ops
# in __init__
self.tf_board_writer = summary_ops.create_file_writer(self.save_path)
....
# During training
with self.tf_board_writer.as_default(), summary_ops.always_record_summaries():
summary_ops.scalar(name=name, tensor=value, step=step)
如果 ml-engine 团队的任何人在这里结束了,现在 TF2 已经稳定并发布了,你知道它什么时候可以在 运行time 环境中使用吗?
无论如何,希望有人能帮助我:)
问题可以通过使用 python 包 cloudml-hypertune
和以下代码来解决:
self.hpt.report_hyperparameter_tuning_metric(
hyperparameter_metric_tag=hypeparam_metric_name,
metric_value=value,
global_step=step)
然后将HyperparameterSpec
中的hyperparameterMetricTag
设置为hypeparam_metric_name