Azure Batch 抢占状态
Azure Batch Preempted state
我有一个 TVM/pool 运行 在 Azure 批处理下突然进入 Preempted 状态。现在的问题是,它现在不接受任何请求。
我还设置了 Scale 公式,其中只要我有超过 0 个待处理的作业在 Azure 批处理中执行,它就会给我一个 VM。但显然这也不起作用。在 TVM 进入抢占状态之前它一直在工作。
如何处理这些情况?
AFAIK,我认为是低优先级节点的节点可以根据可用容量进入“抢占”状态。因此,低优先级 VM 最适合某些类型的工作负载。将低优先级 VM 用于批处理和异步处理工作负载,其中作业完成时间灵活且工作分布在许多 VM 上。这就是此处定义的行为:https://docs.microsoft.com/en-us/azure/batch/batch-low-pri-vms
我认为问题的后半部分很可能也与您的 VM 被抢占这一事实有关。
Given the characteristics of low-priority VMs, what workloads can and cannot use them? In general, batch processing workloads are a good fit, as jobs are broken into many parallel tasks or there are many jobs that are scaled out and distributed across many VMs.
To maximize use of surplus capacity in Azure, suitable jobs can scale out.
Occasionally VMs may not be available or are preempted, which results in reduced capacity for jobs and may lead to task interruption and reruns. Jobs must therefore be flexible in the time they can take to run.
Jobs with longer tasks may be impacted more if interrupted. If long-running tasks implement checkpointing to save progress as they execute, then the impact of interruption is reduced. Tasks with shorter execution times tend to work best with low-priority VMs, because the impact of interruption is far less.
Long-running MPI jobs that utilize multiple VMs are not well suited to use low-priority VMs, because one preempted VM can lead to the whole job having to run again.
希望对您有所帮助。
我有一个 TVM/pool 运行 在 Azure 批处理下突然进入 Preempted 状态。现在的问题是,它现在不接受任何请求。
我还设置了 Scale 公式,其中只要我有超过 0 个待处理的作业在 Azure 批处理中执行,它就会给我一个 VM。但显然这也不起作用。在 TVM 进入抢占状态之前它一直在工作。
如何处理这些情况?
AFAIK,我认为是低优先级节点的节点可以根据可用容量进入“抢占”状态。因此,低优先级 VM 最适合某些类型的工作负载。将低优先级 VM 用于批处理和异步处理工作负载,其中作业完成时间灵活且工作分布在许多 VM 上。这就是此处定义的行为:https://docs.microsoft.com/en-us/azure/batch/batch-low-pri-vms
我认为问题的后半部分很可能也与您的 VM 被抢占这一事实有关。
Given the characteristics of low-priority VMs, what workloads can and cannot use them? In general, batch processing workloads are a good fit, as jobs are broken into many parallel tasks or there are many jobs that are scaled out and distributed across many VMs.
To maximize use of surplus capacity in Azure, suitable jobs can scale out.
Occasionally VMs may not be available or are preempted, which results in reduced capacity for jobs and may lead to task interruption and reruns. Jobs must therefore be flexible in the time they can take to run.
Jobs with longer tasks may be impacted more if interrupted. If long-running tasks implement checkpointing to save progress as they execute, then the impact of interruption is reduced. Tasks with shorter execution times tend to work best with low-priority VMs, because the impact of interruption is far less.
Long-running MPI jobs that utilize multiple VMs are not well suited to use low-priority VMs, because one preempted VM can lead to the whole job having to run again.
希望对您有所帮助。