Hyperdrive 运行失败次数的阈值

Threshold for allowed amount of failed Hyperdrive runs

因为“原因”，我们知道当我们使用 azureml-sdk 的 HyperDriveStep 时，我们预计会有一些 HyperDrive 运行失败——通常大约20%。我们如何在不使整个 HyperDriveStep（以及所有下游步骤）失败的情况下处理这个问题？下面是管道的示例。

我以为会有一个 HyperDriveRunConfig param to allow for this, but it doesn't seem to exist. Perhaps this is controlled on the Pipeline itself with the continue_on_step_failure 参数？

我们正在考虑的解决方法是在我们的 train.py 脚本中捕获失败的运行并手动将 primary_metric 记录为零。

感谢您的提问。

我假设 HyperDriveStep 是您的 Pipeline 中的步骤之一，并且您希望剩余的 Pipeline 步骤在 HyperDriveStep 失败时继续，对吗？启用 continue_on_step_failure，应该允许其余的管道步骤在任何单个步骤失败时继续。

此外，HyperDrive 运行由多个子运行组成，由 HyperDriveConfig 控制。如果 HyperDrive 探索的前 3 个子运行失败（例如用户脚本错误），系统会自动取消整个 HyperDrive 运行，以避免进一步浪费资源。

您是否希望在 HyperDriveStep 失败时继续执行其他流水线步骤？或者当前 3 个子运行失败时，您是否希望在 HyperDrive 运行中继续其他子运行？

谢谢！

Hyperdrive 运行失败次数的阈值

Threshold for allowed amount of failed Hyperdrive runs

azure-machine-learning-service

azureml