AWS 多区域 VPC 中的 Cassandra 集群

Cassandra Cluster in AWS Multi Region VPC

我正在尝试为我的 Cassandra 集群实现以下架构:

到目前为止,我已经能够实现集群的配置、安装 OpsCenter 并检查每个代理是否正常工作。 (作为参考,我使用了 GossipPropertyFileSnitch 并在机架配置中放置了 "dc=us-west, rack=1b"。

我的问题是我的 HTTP API 很慢,而且超时时间太多了。我一直在尝试 运行 一些导入脚本(通过 CQL 驱动程序通过 HTTP 插入 Cassandra)并不断收到此类错误:

Error while executing batch:com.google.common.util.concurrent.UncheckedExecutionException: java.lang.Runtim eException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.

作为参考,system.log中对应的错误是:

ERROR [SharedPool-Worker-1] 2015-03-04 19:25:39,598 ErrorMessage.java:243 - Unexpected exception during request
com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2201) ~[guava-16.0.jar:na]
at com.google.common.cache.LocalCache.get(LocalCache.java:3934) ~[guava-16.0.jar:na]
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3938) ~[guava-16.0.jar:na]
at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4821) ~[guava-16.0.jar:na]
at org.apache.cassandra.auth.PermissionsCache.getPermissions(PermissionsCache.java:56) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.auth.Auth.getPermissions(Auth.java:78) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.service.ClientState.authorize(ClientState.java:352) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.service.ClientState.ensureHasPermission(ClientState.java:250) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.service.ClientState.hasAccess(ClientState.java:244) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.service.ClientState.hasColumnFamilyAccess(ClientState.java:228) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.cql3.statements.ModificationStatement.checkAccess(ModificationStatement.java:128) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.cql3.statements.BatchStatement.checkAccess(BatchStatement.java:86) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.cql3.QueryProcessor.processBatch(QueryProcessor.java:500) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.transport.messages.BatchMessage.execute(BatchMessage.java:215) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:439) [apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:335) [apache-cassandra-2.1.3.jar:2.1.3]
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) [netty-all-4.0.23.Final.jar:4.0.23.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) [netty-all-4.0.23.Final.jar:4.0.23.Final]
at io.netty.channel.AbstractChannelHandlerContext.access0(AbstractChannelHandlerContext.java:32) [netty-all-4.0.23.Final.jar:4.0.23.Final]
at io.netty.channel.AbstractChannelHandlerContext.run(AbstractChannelHandlerContext.java:324) [netty-all-4.0.23.Final.jar:4.0.23.Final]
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) [na:1.8.0_31]
at org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$FutureTask.run(AbstractTracingAwareExecutorService.java:164) [apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105) [apache-cassandra-2.1.3.jar:2.1.3]
at java.lang.Thread.run(Unknown Source) [na:1.8.0_31]
Caused by: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
at org.apache.cassandra.auth.Auth.selectUser(Auth.java:279) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.auth.Auth.isSuperuser(Auth.java:100) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.auth.AuthenticatedUser.isSuper(AuthenticatedUser.java:50) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:67) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.auth.PermissionsCache.load(PermissionsCache.java:82) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.auth.PermissionsCache.load(PermissionsCache.java:79) ~[apache-cassandra-2.1.3.jar:2.1.3]
at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3524) ~[guava-16.0.jar:na]
at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2317) ~[guava-16.0.jar:na]
at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2280) ~[guava-16.0.jar:na]
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2195) ~[guava-16.0.jar:na]
... 23 common frames omitted
Caused by: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:103) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:139) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1338) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1265) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1188) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:253) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:206) ~[apache-cassandra-2.1.3.jar:2.1.3]
at org.apache.cassandra.auth.Auth.selectUser(Auth.java:268) ~[apache-cassandra-2.1.3.jar:2.1.3]
... 32 common frames omitted

它有时确实有效,我什至能够连接到 DevCenter 并实际看到我的数据。但是失败太多了。

我的临时解决方案是在每个实例的 public IP 上启用通信,并且仍然让它们在私有 IP 上一起工作。我现在正在导入。

现在我还在想:

感谢您的帮助。

我个人认为这个解决方案不可行。有几个原因。

  1. 区域之间将存在巨大的延迟。想象一下,您可能想要存储在集群中的所有数据都需要通过 Internet 进行复制,使用 VPN 或 SSL encryption/decryption,具体取决于您选择的方法。我假设您选择 Cassandra 是因为您计划拥有大量数据。
  2. 您将付出高昂的代价,因为八卦协议非常繁琐,并且您的所有数据都将多次来回通过端点。对于从一个节点发送到另一个节点的每 GB,您需要为每 GB 支付 0.02 美元。
  3. 除非您在 cassandra.yaml 中增加所有相关的超时值,否则您将继续遇到超时,但那样就会很慢。

您可以对节点执行 SSL,这里是 detail

我不是 100% 确定超时原因,但有一个严重的迹象表明它来自节点在超时值内没有收到其他节点的响应:

Operation timed out - received only 0 responses.

我建议设置一个多数据中心集群,其中一个数据中心位于同一区域,另一个数据中心位于另一个区域。通过这种方式,您的应用程序与一组本地节点对话,然后数据被复制到远程数据中心节点。 Cassandra 有办法减少 multi-region datacenters.

之间的流量

Here 是关于多区域数据中心的精彩幻灯片演示。它还有一些我没有在这里介绍的有用信息。