AWS Aurora 无服务器 - 通信 Link 失败

AWS Aurora Serverless - Communication Link Failure

我在我的 python 代码中使用 MySQL Aurora Serverless 集群(启用了数据 API),我收到 communications link failure 异常。这通常发生在集群休眠一段时间后。

但是,一旦集群处于活动状态,我就不会收到任何错误消息。我每次都要发送3-4个请求才能正常工作。

异常详情:

The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server. An error occurred (BadRequestException) when calling the ExecuteStatement operation: Communications link failure

我该如何解决这个问题?我正在使用标准的 boto3 库

这是来自 AWS 高级业务支持的回复。

Summary: It is an expected behavior

详细答案:

I can see that you receive this error when your Aurora Serverless instance is inactive and you stop receiving it once your instance is active and accepting connection. Please note that this is an expected behavior. In general, Aurora Serverless works differently than Provisioned Aurora , In Aurora Serverless, while the cluster is "dormant" it has no compute resources assigned to it and when a db. connection is received, Compute resources are assigned. Because of this behavior, you will have to "wake up" the clusters and it may take a few minutes for the first connection to succeed as you have seen.

In order to avoid that you may consider increasing the timeout on the client side. Also, if you have enabled Pause, you may consider disabling it [2]. After disabling Pause, you can also adjust the minimum Aurora capacity unit to higher value to make sure that your Cluster always having enough computing resource to serve the new connections [3]. Please note that adjusting the minimum ACU might increase the cost of service [4].

Also note that Aurora Serverless is only recommend for certain workloads [5]. If your workload is highly predictable and your application needs to access the DB on a regular basis, I would recommend you use Provisioned Aurora cluster/instance to insure high availability of your business.

[2] Aurora Serverless 的工作原理 - Aurora Serverless 的自动暂停和恢复 - https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-serverless.how-it-works.html#aurora-serverless.how-it-works.pause-resume

[3] 设置 Aurora Serverless 数据库集群的容量 - https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-serverless.setting-capacity.html

[4] Aurora 无服务器价格 https://aws.amazon.com/rds/aurora/serverless/

[5] 使用 Amazon Aurora Serverless - Aurora Serverless 使用案例 - https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-serverless.html#aurora-serverless.use-cases

如果它对某人有用,这就是我在 Aurora Serverless 唤醒时管理重试的方式。

客户端 returns BadRequestException,因此即使您更改客户端的配置,boto3 也不会重试,请参阅 https://boto3.amazonaws.com/v1/documentation/api/latest/guide/retries.html

我的第一个选择是尝试使用 Waiter,但 RDSData 没有任何 Waiter,然后我尝试创建一个带有错误匹配器的自定义 Waiter,但只尝试匹配错误代码,忽略消息,因为 BadRequestException 可能是由 sql 语句中的错误引起我也需要验证消息,所以我使用了一种服务员函数:

def _wait_for_serverless():
    delay = 5
    max_attempts = 10

    attempt = 0
    while attempt < max_attempts:
        attempt += 1

        try:
            rds_data.execute_statement(
                database=DB_NAME,
                resourceArn=CLUSTER_ARN,
                secretArn=SECRET_ARN,
                sql_statement='SELECT * FROM dummy'
            )
            return
        except ClientError as ce:
            error_code = ce.response.get("Error").get('Code')
            error_msg = ce.response.get("Error").get('Message')

            # Aurora serverless is waking up
            if error_code == 'BadRequestException' and 'Communications link failure' in error_msg:
                logger.info('Sleeping ' + str(delay) + ' secs, waiting RDS connection')
                time.sleep(delay)
            else:
                raise ce

    raise Exception('Waited for RDS Data but still getting error')

我是这样使用的:

def begin_rds_transaction():
    _wait_for_serverless()

    return rds_data.begin_transaction(
        database=DB_NAME,
        resourceArn=CLUSTER_ARN,
        secretArn=SECRET_ARN
    )

我也遇到了这个问题,从 Arless 使用的解决方案以及与 Jimbo 的谈话中得到启发,提出了以下解决方法。

我定义了一个装饰器,它会重试无服务器 RDS 请求,直到可配置的重试持续时间到期。

import logging
import functools
from sqlalchemy import exc
import time

logger = logging.getLogger()


def retry_if_db_inactive(max_attempts, initial_interval, backoff_rate):
    """
    Retry the function if the serverless DB is still in the process of 'waking up'.
    The configration retries follows the same concepts as AWS Step Function retries.
    :param max_attempts: The maximum number of retry attempts
    :param initial_interval: The initial duration to wait (in seconds) when the first 'Communications link failure' error is encountered
    :param backoff_rate: The factor to use to multiply the previous interval duration, for the next interval
    :return:
    """

    def decorate_retry_if_db_inactive(func):

        @functools.wraps(func)
        def wrapper_retry_if_inactive(*args, **kwargs):
            interval_secs = initial_interval
            attempt = 0
            while attempt < max_attempts:
                attempt += 1
                try:
                    return func(*args, **kwargs)

                except exc.StatementError as err:
                    if hasattr(err.orig, 'response'):
                        error_code = err.orig.response["Error"]['Code']
                        error_msg = err.orig.response["Error"]['Message']

                        # Aurora serverless is waking up
                        if error_code == 'BadRequestException' and 'Communications link failure' in error_msg:
                            logger.info('Sleeping for ' + str(interval_secs) + ' secs, awaiting RDS connection')
                            time.sleep(interval_secs)
                            interval_secs = interval_secs * backoff_rate
                        else:
                            raise err
                    else:
                        raise err

            raise Exception('Waited for RDS Data but still getting error')

        return wrapper_retry_if_inactive

    return decorate_retry_if_db_inactive

然后可以这样使用:

@retry_if_db_inactive(max_attempts=4, initial_interval=10, backoff_rate=2)
def insert_alert_to_db(sqs_alert):
    with db_session_scope() as session:
        # your db code
        session.add(sqs_alert)

    return None

请注意,我使用的是 sqlalchemy,因此需要调整代码以适应特定目的,但希望作为入门者会有用。