管理 EventBridge -> Lambda 异步重试行为

Managing EventBridge -> Lambda async retry behaviour

我正在尝试在 Eventbridge 通过事件规则异步调用 Lambda 函数的情况下管理 Lambda 重试(请参阅底部的模板)

我尝试在 Eventbridge 和 Lambda 端配置重试行为,特别是 -

我可以向 Eventbridge 推送“好的”消息 -

{'action': 'add', 'args': {'x': 2, 'y': 2}}

这被 Lambda 接收 -

[INFO]  2021-11-19T06:56:25.242Z    590c6514-ad4d-4906-a748-9820af748e76    received: {'version': '0', 'id': '62f363a1-9e0e-a154-8d6a-bce81d22d47f', 'detail-type': 'foobar', 'source': 'whatevs', 'account': '119552584133', 'time': '2021-11-19T06:56:24Z', 'region': 'eu-west-1', 'resources': [], 'detail': {'action': 'add', 'args': {'x': 2, 'y': 2}}}
[INFO]  2021-11-19T06:56:25.242Z    590c6514-ad4d-4906-a748-9820af748e76    result: 4

我也可以向 Eventbridge 发送“错误”消息 -

{'action': 'add', 'args': {'x': 1, 'y': 'a'}}

这会导致 Lambda 错误 -

[INFO]  2021-11-19T06:50:49.603Z    b25129f4-d89a-493c-b85e-7ffaef995c71    received: {'version': '0', 'id': '8bb8b3d2-3725-8a24-19ea-547a6a8b799d', 'detail-type': 'foobar', 'source': 'whatevs', 'account': '119552584133', 'time': '2021-11-19T06:47:53Z', 'region': 'eu-west-1', 'resources': [], 'detail': {'action': 'add', 'args': {'x': 1, 'y': 'x'}}}
[ERROR] TypeError: unsupported operand type(s) for +: 'int' and 'str'Traceback (most recent call last):  File "/var/task/index.py", line 7, in handler    result=args["x"]+args["y"]

到目前为止一切顺利 - 但问题是我仍然在大约 T+60 和 T+180 秒时获得标准的 Lambda 重试行为,导致进一步的错误 -

[INFO]  2021-11-19T06:52:46.142Z    897efce2-bb04-45d8-8b3b-4e1e854cdc13    received: {'version': '0', 'id': '56252e23-dbb1-8025-9eda-45cecaa9f04e', 'detail-type': 'foobar', 'source': 'whatevs', 'account': '119552584133', 'time': '2021-11-19T06:52:45Z', 'region': 'eu-west-1', 'resources': [], 'detail': {'action': 'add', 'args': {'x': 1, 'y': 'a'}}}
[ERROR] TypeError: unsupported operand type(s) for +: 'int' and 'str'Traceback (most recent call last):  File "/var/task/index.py", line 7, in handler    result=args["x"]+args["y"]
[INFO]  2021-11-19T06:53:50.326Z    897efce2-bb04-45d8-8b3b-4e1e854cdc13    received: {'version': '0', 'id': '56252e23-dbb1-8025-9eda-45cecaa9f04e', 'detail-type': 'foobar', 'source': 'whatevs', 'account': '119552584133', 'time': '2021-11-19T06:52:45Z', 'region': 'eu-west-1', 'resources': [], 'detail': {'action': 'add', 'args': {'x': 1, 'y': 'a'}}}
[ERROR] TypeError: unsupported operand type(s) for +: 'int' and 'str'Traceback (most recent call last):  File "/var/task/index.py", line 7, in handler    result=args["x"]+args["y"]
[INFO]  2021-11-19T06:55:59.477Z    897efce2-bb04-45d8-8b3b-4e1e854cdc13    received: {'version': '0', 'id': '56252e23-dbb1-8025-9eda-45cecaa9f04e', 'detail-type': 'foobar', 'source': 'whatevs', 'account': '119552584133', 'time': '2021-11-19T06:52:45Z', 'region': 'eu-west-1', 'resources': [], 'detail': {'action': 'add', 'args': {'x': 1, 'y': 'a'}}}
[ERROR] TypeError: unsupported operand type(s) for +: 'int' and 'str'Traceback (most recent call last):  File "/var/task/index.py", line 7, in handler    result=args["x"]+args["y"]

而且违规事件永远不会在事件 DLQ 或 Lambda 目标中结束。

我在这里遗漏了什么,我需要做什么才能关闭这些重试并让事件显示在 DLQ/destination 中?

(为了更好的衡量,是否应该在 Eventbridge 或 Lambda 端配置错误处理/重试?当然我不需要两者?)


AWSTemplateFormatVersion: '2010-09-09'
Outputs:
  MyEventBus:
    Value:
      Ref: MyEventBus
  MyEventsDLQ:
    Value:
      Ref: MyEventsDLQ
  MyFunctionDestination:
    Value:
      Ref: MyFunctionDestination
Parameters:
  LambdaHandlerName:
    Default: "index.handler"
    Type: String
  LambdaSize:
    Default: 512
    Type: Number
  LambdaRuntime:
    Default: 'python3.8'
    Type: String
  LambdaTimeout:
    Default: 5
    Type: Number
Resources:
  MyFunction:
    Properties:
      Code:
       ZipFile: |
         import logging
         logger=logging.getLogger()
         logger.setLevel(logging.INFO)
         def handler(event, context):
           logger.info("received: %s" % event)
           args=event["detail"]["args"]
           result=args["x"]+args["y"]
           logger.info("result: %s" % result)
      Handler:
        Ref: LambdaHandlerName
      MemorySize:
        Ref: LambdaSize
      Role:
        Fn::GetAtt:
        - MyFunctionRole
        - Arn
      Runtime:
        Ref: LambdaRuntime
      Timeout:
        Ref: LambdaTimeout
    Type: AWS::Lambda::Function
  MyFunctionRole:
    Properties:
      AssumeRolePolicyDocument:
        Statement:
        - Action: sts:AssumeRole
          Effect: Allow
          Principal:
            Service: lambda.amazonaws.com
        Version: '2012-10-17'
      Policies:
      - PolicyDocument:
          Statement:
          - Action: logs:*
            Effect: Allow
            Resource: '*'
          - Action: sqs:*
            Effect: Allow
            Resource: '*'
          Version: '2012-10-17'
        PolicyName:
          Fn::Sub: my-function-role-policy-${AWS::StackName}
    Type: AWS::IAM::Role
  MyEventsFunctionPermission:
    Properties:
      Action: lambda:InvokeFunction
      FunctionName:
        Ref: MyFunction
      Principal: events.amazonaws.com
      SourceArn:
        Fn::GetAtt:
        - MyEventRule
        - Arn
    Type: AWS::Lambda::Permission
  MyEventRule:
    Properties:
      EventBusName:
        Ref: MyEventBus
      EventPattern:
        detail:
          action:
            - add
      State: ENABLED
      Targets:
      - Arn:
          Fn::GetAtt:
          - MyFunction
          - Arn
        Id:
          Fn::Sub: my-rule-${AWS::StackName}
        RetryPolicy:
          MaximumRetryAttempts: 0
        DeadLetterConfig:
          Arn:
            Fn::GetAtt:
              - MyEventsDLQ
              - Arn
    Type: AWS::Events::Rule
  MyEventBus:
    Properties:
      Name:
        Fn::Sub: my-event-bus-${AWS::StackName}
    Type: AWS::Events::EventBus
  MyEventsDLQ:
    Properties: {}
    Type: AWS::SQS::Queue
  MyEventsDLQPolicy:
    Properties:
      Queues:
        - Ref: MyEventsDLQ
      PolicyDocument:
        Statement:
          - Action: sqs:SendMessage
            Effect: Allow
            Principal:
              Service: events.amazonaws.com
    Type: AWS::SQS::QueuePolicy
  MyFunctionDestination:
    Properties: {}
    Type: AWS::SQS::Queue
  MyFunctionEventConfig:
    Properties:
      DestinationConfig:
        OnFailure:
          Destination:
            Fn::GetAtt:
            - MyFunctionDestination
            - Arn
      FunctionName:
        Ref: MyFunction
      MaximumRetryAttempts: 0
      Qualifier:
        Fn::GetAtt:
        - MyFunctionVersion
        - Version
    Type: AWS::Lambda::EventInvokeConfig
  MyFunctionVersion:
    Properties:
      FunctionName:
        Ref: MyFunction
    Type: AWS::Lambda::Version

尝试在 MyFunctionEventConfig 上设置 Qualifier: $LATEST

正如您所说,观察到的行为与 MyFunctionEventConfig Destination 根本没有被调用是一致的。我怀疑这是因为您已经使用新创建的 Lambda 版本 MyFunctionVersion 限定了 Destination。但我不相信你会调用那个版本。所以 Destination 也永远不会被调用。

除非你的 AWS::Lambda::Version 正在为你工作,否则你可以删除它并使用 Qualifier: $LATEST

编辑 - 更多信息:

触发器和目标取决于版本,因为每个 lambda 版本都有自己的 ARN。

您可以在 lambda 控制台中测试它而无需重新部署。如果版本假设是正确的,目标将不会出现在 lambda 控制台的“功能概述”部分,除非您首先 select 快照版本。