当 Azure WebJob 失败时无法使 Azure DevOps 发布管道失败 start/run

Question

我在找什么：

我们如何将一个自动化解决方案集成到我们的发布管道中，以了解新的 WebJob 部署是否在 'X' 时间段内进入了运行状态？

更多详情：

我们将 Azure DevOps 发布管道与 AzureRMWebAppDelopyment@4 任务结合使用。我们能够将 Azure WebJob 部署到暂存和生产环境。

最近我们发现我们的 WebJob 由于一些错误的代码而没有真正启动。由于 WebJob 的性质，我们无法在暂存中轻松识别它。我们将错误的代码部署到生产环境中，几天后，由于错误的警报，得知 WebJob 不是运行并且我们的队列已严重备份。

这个问题是我们 want/need 我们的发布管道报告每个 WebJob 启动失败。 API 使用 HealthChecks 来验证部署是否已启动、是否健康以及是否真的可以正常运行。我们需要在我们的发布管道期间检查 WebJob 的状态，以便管道失败，所以我们不认为一切都在工作。

在我们的研究中，我们发现我们可以潜在地使用 Kudu，但到目前为止，我们无法找到如何让它作为发布管道的一部分工作。

Answer 1

您可以尝试这样做：

您可以创建一个单独的舞台或在您现有的舞台内进行。通过添加 Agentless job 可以添加延迟任务。之后，您可以调用 kudu 端点来检查 WebJob 的历史记录（如 here 所示）。如果您没有找到令人满意的响应，您可以简单地用 exit 1 结束您的脚本以失败发布。

这里有一个article about accessing KUDU

从什么 I found:

Off the top, the MFA is the contract between AAD and User, when it gets to Kudu endpoint, it is just a token. It should just work, IMO.

Answer 2

综合多个来源的想法后，我们提出了这个解决方案：

在所需阶段的所需 Azure Release Pipeline 中添加 Azure CLI 任务。此任务可以接受内联 PowerShell 脚本或 PowerShell 脚本的路径。 选择你自己的冒险。我们选择使用包含的脚本（如下）创建一个 CheckWebJobStatus.ps1 并将其公开为可供我们的 Azure 发布管道使用的工件。

这个 PowerShell 脚本的作用简而言之：
它最多检查目标 WebJob 的状态 10 次（可通过 $totalRuns 配置），检查之间等待 5 秒，并等待 3 个连续的 Running 状态报告。

param(
    $resourceGroup,
    $appServiceName,
    $jobName,
    $totalRuns = 10
)

Write-Host "Checking status of $jobName in $resourceGroup/$appServiceName"

$consecutiveRunningStatuses = 0
if ($totalRuns -lt 3) {
    Write-Error "totalRuns must be 3 or greater"
    exit 1
}

for ($i = 0; $i -lt $totalRuns; $i++) {
    $jobs = (az webapp webjob continuous list --name $appServiceName --resource-group $resourceGroup | ConvertFrom-Json)

    foreach ($job in $jobs) {
        if ($job.name -eq "$appServiceName/$jobName") {
            if ($job.status -eq "Running") {
                Write-Host "$jobName is running! Attempt $i"
                $consecutiveRunningStatuses++

                if ($consecutiveRunningStatuses -eq 3) {
                    Write-Host "$jobName is running $consecutiveRunningStatuses times in a row! We assume that means it is stable."
                    exit 0
                }
            }
            else {
                Write-Host "$jobName status is $($job.status). Attempt $i"
                $consecutiveRunningStatuses = 0
            }
        }
    }

    if ($i -ne ($totalRuns - 1)) {
        Start-Sleep 5
    }
}

Write-Host "$jobName failed to start after $totalRuns checks"
exit 1

为什么连续 3 次 Running 状态报告？
因为 Azure WebJobs 状态报告不可靠。当 WebJob 首次部署时，它会进入 Starting 状态，然后进入 Running 状态。到目前为止这看起来不错。但是，如果启动时出现致命错误，如缺少依赖项，则作业会进入 Pending Restart 状态。在我们的观察中，Azure 要么自动尝试再次启动 WebJob，要么状态变得奇怪并被错误地报告为处于 Running 状态。然后，WebJob 将 re-enter 变为 Pending Restart 状态并保持该状态，直到下一次显式尝试部署或启动它。在我们的观察中，我们没有看到失败的 WebJob 保持 Running 状态超过 2 个相隔 5 秒的连续报告，换句话说，在任何 15 秒内 window。因此，在脚本中，我们现在假设，如果我们在 15 秒内收到 3 个连续的 Running 状态报告，则 WebJob 被假定为 Running.

旁白 - 我们是如何做到的：
我们创建了一个专用的 DeployTools 存储库，它有自己的 azure-pipelines.yaml 构建配置，它只发布包含该 PowerShell 文件的文件夹。然后在我们想要的 Azure Release Pipeline 中，我们附加了 DeployTools 构建中的工件。

当 Azure WebJob 失败时无法使 Azure DevOps 发布管道失败 start/run

Unable to fail Azure DevOps Release Pipeline when Azure WebJob fails to start/run

azure-webjobs

azure-devops