使用 ARM 模板部署 AKS 集群偶尔会失败并出现 PutNetworkSecurityGroupOperation 错误

Deploying AKS cluster with ARM template sporadically fails with PutNetworkSecurityGroupOperation error

我正在使用 Azure 模板部署 AKS 群集。大多数情况下部署 AKS 群集会成功。但有时,使用相同的输入,部署会失败并显示 Operation PutNetworkSecurityGroupOperation (XXXXXXXX) was canceled and superseded by operation PutNetworkSecurityGroupOperation。 Azure 模板和部署错误包含在下面。什么可能导致此问题?

模板

{
  "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
  "contentVersion": "1.0.0.0",
  "parameters": {
    "resourceGroupName": {
      "type": "string",
      "metadata": {
        "description": "The resource group name."
      }
    },
    "subscriptionId": {
      "type": "string",
      "metadata": {
        "description": "The subscription id."
      }
    },
    "region": {
      "type": "string",
      "metadata": {
        "description": "The region of AKS resource."
      }
    },
    "gbPerNode": {
      "type": "int",
      "defaultValue": 20,
      "metadata": {
        "description": "Disk size (in GB) to provision for each of the agent pool nodes. This value ranges from 0 to 1023. Specifying 0 will apply the default disk size for that agentVMSize."
      },
      "minValue": 1,
      "maxValue": 1023
    },
    "numNodes": {
      "type": "int",
      "defaultValue": 3,
      "metadata": {
        "description": "The number of agent nodes for the cluster."
      },
      "minValue": 1,
      "maxValue": 50
    },
    "machineType": {
      "type": "string",
      "defaultValue": "Standard_D2_v2",
      "metadata": {
        "description": "The size of the Virtual Machine."
      }
    },
    "servicePrincipalClientId": {
      "metadata": {
        "description": "Client ID (used by cloudprovider)"
      },
      "type": "securestring"
    },
    "servicePrincipalClientSecret": {
      "metadata": {
        "description": "The Service Principal Client Secret."
      },
      "type": "securestring"
    },
    "osType": {
      "type": "string",
      "defaultValue": "Linux",
      "allowedValues": [
        "Linux"
      ],
      "metadata": {
        "description": "The type of operating system."
      }
    },
    "kubernetesVersion": {
      "type": "string",
      "defaultValue": "1.11.5",
      "metadata": {
        "description": "The version of Kubernetes."
      }
    },
    "maxPods": {
      "type": "int",
      "defaultValue": 30,
      "metadata": {
        "description": "Maximum number of pods that can run on a node."
      }
    }
  },
  "variables": {
    "deploymentEventTopic": "deploymenteventtopic",
    "resourceGroupName": "[parameters('resourceGroupName')]",
    "omswsName": "[concat('omsws-', parameters('resourceGroupName'))]",
    "clustername": "cluster"
  },
  "resources": [
    {
      "apiVersion": "2018-03-31",
      "type": "Microsoft.ContainerService/managedClusters",
      "location": "[parameters('region')]",
      "name": "[variables('clustername')]",
      "properties": {
        "kubernetesVersion": "[parameters('kubernetesVersion')]",
        "enableRBAC": true,
        "dnsPrefix": "clust",
        "addonProfiles": {
          "httpApplicationRouting": {
            "enabled": true
          },
          "omsagent": {
            "enabled": false
          }
        },
        "agentPoolProfiles": [
          {
            "name": "agentpool",
            "osDiskSizeGB": "[parameters('gbPerNode')]",
            "count": "[parameters('numNodes')]",
            "vmSize": "[parameters('machineType')]",
            "osType": "[parameters('osType')]",
            "storageProfile": "ManagedDisks"
          }
        ],
        "servicePrincipalProfile": {
          "ClientId": "[parameters('servicePrincipalClientId')]",
          "Secret": "[parameters('servicePrincipalClientSecret')]"
        },
        "networkProfile": {
          "networkPlugin": "kubenet"
        }
      }
    }
  ]
}

错误

{
   "code":"DeploymentFailed",
   "message":"At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-debug for usage details.",
   "details":[
      {
         "code":"Conflict",
         "message":"{\r\n \"status\": \"Failed\",\r\n \"error\": {\r\n \"code\": \"ResourceDeploymentFailure\",\r\n \"message\": \"The resource operation completed with terminal provisioning state 'Failed'.\",\r\n \"details\": [\r\n {\r\n \"code\": \"Canceled\",\r\n \"message\": \"Operation was canceled.\",\r\n \"details\": [\r\n {\r\n \"code\": \"Canceled\",\r\n \"message\": \"Operation was canceled.\",\r\n \"details\": [\r\n {\r\n \"code\": \"CanceledAndSupersededDueToAnotherOperation\",\r\n \"message\": \"Operation PutNetworkSecurityGroupOperation (XXXXXXX) was canceled and superseded by operation PutNetworkSecurityGroupOperation (XXXXX).\"\r\n }\r\n ]\r\n }\r\n ]\r\n }\r\n ]\r\n }\r\n}"
      }
   ]
}

原因似乎是在启用 httpApplicationRouting 路由的情况下部署 AKS 群集。为了解决这个问题,我在模板中部署了没有 httpApplicationRouting 的集群,然后在使用 azure java sdk.

部署集群后以编程方式启用它
final KubernetesCluster kCluster = serviceManager.kubernetesClusters()
  .getByResourceGroup(resourceGroupName, deploymentName);

final Map<String, ManagedClusterAddonProfile> addonProfileMap = new HashMap<>();
addonProfileMap.put("httpApplicationRouting", 
   new ManagedClusterAddonProfile().withEnabled(true));

kCluster.update()
  .withAddOnProfiles(addonProfileMap)
  .apply();

我打开了 Azure 的支持票,支持工程师确认这是 AKS 团队正在修复的错误。因此,如果您不想实施解决方法,应该很快就会有解决方案。