AWS Elasticache Redis,当我有一个只有 1 个节点的 Redis(已禁用集群模式)并且它失败时会发生什么

AWS Elasticache Redis, what will happen when I have a Redis (Cluster Mode Disabled) with only 1 node and it fails

我已了解 AWS Elasticache Redis 的自动故障转移功能。文档告诉我,故障转移过程需要我至少有 1 个副本节点(即至少 2 个节点),以便它可以使用副本节点来替换发生故障的主节点。

但是 我找不到有关如果我只有 1 个节点并且它失败会发生什么情况的详细信息。它是自动重新创建的还是需要手动删除并重新创建它?

我打算使用以下 CloudFormation 模板在我的测试环境中创建一个只有 1 个节点的 Redis 组(已禁用集群模式)。

    "ReplicationGroup": {
        "Type": "AWS::ElastiCache::ReplicationGroup",
        "Properties": {
            "ReplicationGroupId" : "my-redis",
            "ReplicationGroupDescription" : "My Redis",
            "NumCacheClusters": 1,
            "AutomaticFailoverEnabled": false,
            "CacheNodeType": "cache.t3.medium",
            "CacheParameterGroupName" : "default.redis5.0",
            "Engine": "redis",
            "EngineVersion" : "5.0.6",
            "Port": "6379",
            "AtRestEncryptionEnabled" : true,
            "TransitEncryptionEnabled" : true,
            "AuthToken" : {"Ref": "AuthToken"},
            "CacheSubnetGroupName": {"Ref": "SubnetGroup"},
            "SecurityGroupIds": [
                {"Ref": "RedisSecurityGroup"}
            ],
            "SnapshotRetentionLimit": 0,
            "MultiAZEnabled" : {"Fn::If": ["ConditionMultiAZEnabled", true, false]}
        }
    },

具体流程视情况而定。

单个节点位于可用区中,因此如果可用区出现问题,那么您的节点可能会受到影响,而您几乎无能为力。如果您想恢复访问,您需要在另一个 AZ 中创建另一个节点。

如果是底层主机故障(例如机架断电、物理服务器需要重启等),AWS 将尝试将其迁移到同一可用区中的另一台主机。

大多数托管服务遵循与 EC2 主机相同的恢复过程,因为这些是服务运行 的幕后工作。

我们以前遇到过这个问题。当 AWS 尝试安装重要的安全更新时,我们丢失了所有数据(不符合服务更新 SLA)。它是一个单节点 Elasticache 实例。这是包含来自 AWS Support 的所有详细信息的回复;

As you said, I found there were event messages on the cluster and BytesUsedForCache was dropped to 0. When I investigated the redis node, I was able to see that health check from ElastiCache service was failed since hardware failure and the node ***** was replaced to healthy new node to recover the redis service. Due to the redis cluster ***** has only single node *****, data loss can happen whenever the node is failed like this case.

To improve high availability to the redis cluster and keep your data in node failure case, you should make a replication group by adding at least a replica node to the cluster. Please read this link to understand replication group in detail. https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/Replication.html

Replica node can be used for only read request, but data is always replicated from primary node to replica node. Also replica node can be promoted to new primary when primary is failed, and then you can protect your data. This link provides how to add replica node . https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/Replication.AddReadReplica.html

Furthermore, you can also enable Multi-az with auto failover feature with replication group. It can failover primary node automatically when the primary node is failed. It can also jazz up High Aavailability of your redis cluster. https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/AutoFailover.html