AWS 步骤函数中的重试逻辑

Retry logic in AWS step function

我正在测试step函数的重试逻辑。 理论上,如果失败,应该重试以下步骤函数以执行 lambda 3 次。

{
  "StartAt": "Bazinga",
  "States": {
    "Bazinga": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "arn:aws:lambda:ap-southeast-2:518815385770:function:errorTest:$LATEST",
        "Payload": {
          "Input.$": "$"
        }
      },
      "Retry" : [
        {
          "ErrorEquals": [ "States.All", "States.Timeout" ],
          "IntervalSeconds": 1,
          "MaxAttempts": 3,
          "BackoffRate": 1.0
        }
      ],
       "Next": "Fail"
    },
    "Fail": {
      "Type": "Fail"
    }
  }
}

它调用的 lambda 设置为 3 秒后超时。 lambda 冻结 4 秒。这意味着 lambda 超时并抛出 States.Timeout 错误。代码如下:

function sleep(ms){
    return new Promise(resolve=>{
        setTimeout(resolve,ms)
    })
}

exports.handler = async (event) => {
    console.log('------------> executing ....')
    await sleep(4000)
};

问题是步骤函数不会重试任务。这可以从以下 CloudWatch 日志中得到证实。


05:59:36
START RequestId: dd1a2ee9-f389-44be-aaa6-07f2ca7983b0 Version: $LATEST

05:59:36
2019-07-24T05:59:36.340Z dd1a2ee9-f389-44be-aaa6-07f2ca7983b0 INFO ------------> executing ....

05:59:39
END RequestId: dd1a2ee9-f389-44be-aaa6-07f2ca7983b0

05:59:39
REPORT RequestId: dd1a2ee9-f389-44be-aaa6-07f2ca7983b0 Duration: 3003.29 ms Billed Duration: 3000 ms Memory Size: 128 MB Max Memory Used: 26 MB

05:59:39
2019-07-24T05:59:39.317Z dd1a2ee9-f389-44be-aaa6-07f2ca7983b0 Task timed out after 3.00 seconds 

不确定哪里出了问题。感谢任何帮助,提前致谢。

为了回答我自己的问题,我放置的重试逻辑有 2 个问题。

  1. States.All应该是States.ALL(注意L的大小写)
  2. 当 lambda 超时时,抛出的错误是 Lambda.Unknown 而不是 States.Timeout

我用以下代码更新了我的步进函数,现在它可以工作了:

{
  "StartAt": "Bazinga",
  "States": {
    "Bazinga": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "arn:aws:lambda:ap-southeast-2:518815385770:function:errorTest:$LATEST",
        "Payload": {
          "Input.$": "$"
        }
      },
      "Retry" : [
        {
          "ErrorEquals": [ "States.Timeout", "Lambda.Unknown" ],
          "IntervalSeconds": 1,
          "MaxAttempts": 3,
          "BackoffRate": 1.0
        }
      ],
       "Next": "Fail"
    },
    "Fail": {
      "Type": "Fail"
    }
  }
}

因为你没有在 ASL 中定义 TimeoutSeconds。示例:

"Type": "Task",
"Resource": "${FunctionArn}",
"TimeoutSeconds": 3,

否则会抛出Lambda.Unknown失败