Cassandra Compaction 会导致超时吗?
Can Cassandra Compaction Lead To Timeouts?
我正在使用 Titan 1.0.0
尽快将大量数据(35M 顶点)批量加载到 Cassandra 的单节点实例中。在此过程中,清理过程会定期触发,这会改变 x
节点上的某些属性,其中 10000 < x <= 500000
。我确保每个事务只影响恰好 100 个顶点。
最初这个过程是有效的,但是一旦我的图发展出一些超级节点,我就开始看到以下异常:
com.thinkaurelius.titan.diskstorage.TemporaryBackendException:
Caused by: com.netflix.astyanax.connectionpool.exceptions.OperationTimeoutException: OperationTimeoutException: [host=172.18.02(172.18.0.2):9160, latency=4031(4031), attempts=1]TimedOutException(acknowledged_by:0, acknowledged_by_batchlog:true)
at com.netflix.astyanax.thrift.ThriftConverter.ToConnectionPoolException(ThriftConverter.java:171) ~[astyanax-thrift-3.8.0.jar!/:3.8.0]
at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:65) ~[astyanax-thrift-3.8.0.jar!/:3.8.0]
at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:28) ~[astyanax-thrift-3.8.0.jar!/:3.8.0]
at com.netflix.astyanax.thrift.ThriftSyncConnectionFactoryImpl$ThriftConnection.execute(ThriftSyncConnectionFactoryImpl.java:153) ~[astyanax-thrift-3.8.0.jar!/:3.8.0]
at com.netflix.astyanax.connectionpool.impl.AbstractExecuteWithFailoverImpl.tryOperation(AbstractExecuteWithFailoverImpl.java:119) ~[astyanax-core-3.8.0.jar!/:3.8.0]
at com.netflix.astyanax.connectionpool.impl.AbstractHostPartitionConnectionPool.executeWithFailover(AbstractHostPartitionConnectionPool.java:352) ~[astyanax-core-3.8.0.jar!/:3.8.0]
at com.netflix.astyanax.thrift.ThriftKeyspaceImpl.executeOperation(ThriftKeyspaceImpl.java:517) ~[astyanax-thrift-3.8.0.jar!/:3.8.0]
at com.netflix.astyanax.thrift.ThriftKeyspaceImpl.access[=10=]0(ThriftKeyspaceImpl.java:93) ~[astyanax-thrift-3.8.0.jar!/:3.8.0]
at com.netflix.astyanax.thrift.ThriftKeyspaceImpl.execute(ThriftKeyspaceImpl.java:137) ~[astyanax-thrift-3.8.0.jar!/:3.8.0]
at com.thinkaurelius.titan.diskstorage.cassandra.astyanax.AstyanaxStoreManager.mutateMany(AstyanaxStoreManager.java:389) ~[titan-cassandra-1.0.0.jar!/:na]
... 22 common frames omitted
Caused by: org.apache.cassandra.thrift.TimedOutException: null
at org.apache.cassandra.thrift.Cassandra$atomic_batch_mutate_result$atomic_batch_mutate_resultStandardScheme.read(Cassandra.java:29624) ~[cassandra-thrift-2.1.9.jar!/:2.1.9]
at org.apache.cassandra.thrift.Cassandra$atomic_batch_mutate_result$atomic_batch_mutate_resultStandardScheme.read(Cassandra.java:29592) ~[cassandra-thrift-2.1.9.jar!/:2.1.9]
at org.apache.cassandra.thrift.Cassandra$atomic_batch_mutate_result.read(Cassandra.java:29526) ~[cassandra-thrift-2.1.9.jar!/:2.1.9]
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) ~[libthrift-0.9.2.jar!/:0.9.2]
at org.apache.cassandra.thrift.Cassandra$Client.recv_atomic_batch_mutate(Cassandra.java:1108) ~[cassandra-thrift-2.1.9.jar!/:2.1.9]
at org.apache.cassandra.thrift.Cassandra$Client.atomic_batch_mutate(Cassandra.java:1094) ~[cassandra-thrift-2.1.9.jar!/:2.1.9]
at com.netflix.astyanax.thrift.ThriftKeyspaceImpl.internalExecute(ThriftKeyspaceImpl.java:147) ~[astyanax-thrift-3.8.0.jar!/:3.8.0]
at com.netflix.astyanax.thrift.ThriftKeyspaceImpl.internalExecute(ThriftKeyspaceImpl.java:141) ~[astyanax-thrift-3.8.0.jar!/:3.8.0]
at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:60) ~[astyanax-thrift-3.8.0.jar!/:3.8.0]
... 30 common frames omitted
我注意到当这种情况发生时,Cassandra 正忙着 运行 大型压缩作业:
WARN 09:43:59 Compacting large partition test/edgestore:0000000000357200 (252188100 bytes)
WARN 09:47:03 Compacting large partition test/edgestore:6800000000365a80 (1417482764 bytes)
WARN 09:48:37 Compacting large partition test/edgestore:0000000000002480 (127497758 bytes)
WARN 09:51:58 Compacting large partition test/edgestore:6000000000376d00 (227606217 bytes)
WARN 09:54:35 Compacting large partition test/edgestore:d000000000002b00 (124082466 bytes)
WARN 09:58:24 Compacting large partition test/edgestore:6800000000354380 (172991088 bytes)
所以问题很简单:Cassandra Compaction 会导致上述超时吗?如果会,处理这个问题的最佳方法是什么?
Can Cassandra Compaction lead to the above timeouts and if so, what is the best approach for handling this ?
是的,绝对是。根本原因可能不是压缩本身,而是因为 I/O 带宽饱和。下面是一个可能的问题链:
- 重压实
- 磁盘 I/O 跟不上速度
- 数据在内存中的停留时间更长
- 数据被提升到 JVM 老年代
- Stop-the-world 垃圾收集开始启动
- 该节点被其他节点检测为down
首先要检查的是在你的 /var/log/cassandra/system.log
中 grep 关键字 "GC" 并监视 I/O 和 CPU I/O 等待使用dstat
工具
还有配置 JVM 堆大小是多少?
我正在使用 Titan 1.0.0
尽快将大量数据(35M 顶点)批量加载到 Cassandra 的单节点实例中。在此过程中,清理过程会定期触发,这会改变 x
节点上的某些属性,其中 10000 < x <= 500000
。我确保每个事务只影响恰好 100 个顶点。
最初这个过程是有效的,但是一旦我的图发展出一些超级节点,我就开始看到以下异常:
com.thinkaurelius.titan.diskstorage.TemporaryBackendException:
Caused by: com.netflix.astyanax.connectionpool.exceptions.OperationTimeoutException: OperationTimeoutException: [host=172.18.02(172.18.0.2):9160, latency=4031(4031), attempts=1]TimedOutException(acknowledged_by:0, acknowledged_by_batchlog:true)
at com.netflix.astyanax.thrift.ThriftConverter.ToConnectionPoolException(ThriftConverter.java:171) ~[astyanax-thrift-3.8.0.jar!/:3.8.0]
at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:65) ~[astyanax-thrift-3.8.0.jar!/:3.8.0]
at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:28) ~[astyanax-thrift-3.8.0.jar!/:3.8.0]
at com.netflix.astyanax.thrift.ThriftSyncConnectionFactoryImpl$ThriftConnection.execute(ThriftSyncConnectionFactoryImpl.java:153) ~[astyanax-thrift-3.8.0.jar!/:3.8.0]
at com.netflix.astyanax.connectionpool.impl.AbstractExecuteWithFailoverImpl.tryOperation(AbstractExecuteWithFailoverImpl.java:119) ~[astyanax-core-3.8.0.jar!/:3.8.0]
at com.netflix.astyanax.connectionpool.impl.AbstractHostPartitionConnectionPool.executeWithFailover(AbstractHostPartitionConnectionPool.java:352) ~[astyanax-core-3.8.0.jar!/:3.8.0]
at com.netflix.astyanax.thrift.ThriftKeyspaceImpl.executeOperation(ThriftKeyspaceImpl.java:517) ~[astyanax-thrift-3.8.0.jar!/:3.8.0]
at com.netflix.astyanax.thrift.ThriftKeyspaceImpl.access[=10=]0(ThriftKeyspaceImpl.java:93) ~[astyanax-thrift-3.8.0.jar!/:3.8.0]
at com.netflix.astyanax.thrift.ThriftKeyspaceImpl.execute(ThriftKeyspaceImpl.java:137) ~[astyanax-thrift-3.8.0.jar!/:3.8.0]
at com.thinkaurelius.titan.diskstorage.cassandra.astyanax.AstyanaxStoreManager.mutateMany(AstyanaxStoreManager.java:389) ~[titan-cassandra-1.0.0.jar!/:na]
... 22 common frames omitted
Caused by: org.apache.cassandra.thrift.TimedOutException: null
at org.apache.cassandra.thrift.Cassandra$atomic_batch_mutate_result$atomic_batch_mutate_resultStandardScheme.read(Cassandra.java:29624) ~[cassandra-thrift-2.1.9.jar!/:2.1.9]
at org.apache.cassandra.thrift.Cassandra$atomic_batch_mutate_result$atomic_batch_mutate_resultStandardScheme.read(Cassandra.java:29592) ~[cassandra-thrift-2.1.9.jar!/:2.1.9]
at org.apache.cassandra.thrift.Cassandra$atomic_batch_mutate_result.read(Cassandra.java:29526) ~[cassandra-thrift-2.1.9.jar!/:2.1.9]
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) ~[libthrift-0.9.2.jar!/:0.9.2]
at org.apache.cassandra.thrift.Cassandra$Client.recv_atomic_batch_mutate(Cassandra.java:1108) ~[cassandra-thrift-2.1.9.jar!/:2.1.9]
at org.apache.cassandra.thrift.Cassandra$Client.atomic_batch_mutate(Cassandra.java:1094) ~[cassandra-thrift-2.1.9.jar!/:2.1.9]
at com.netflix.astyanax.thrift.ThriftKeyspaceImpl.internalExecute(ThriftKeyspaceImpl.java:147) ~[astyanax-thrift-3.8.0.jar!/:3.8.0]
at com.netflix.astyanax.thrift.ThriftKeyspaceImpl.internalExecute(ThriftKeyspaceImpl.java:141) ~[astyanax-thrift-3.8.0.jar!/:3.8.0]
at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:60) ~[astyanax-thrift-3.8.0.jar!/:3.8.0]
... 30 common frames omitted
我注意到当这种情况发生时,Cassandra 正忙着 运行 大型压缩作业:
WARN 09:43:59 Compacting large partition test/edgestore:0000000000357200 (252188100 bytes)
WARN 09:47:03 Compacting large partition test/edgestore:6800000000365a80 (1417482764 bytes)
WARN 09:48:37 Compacting large partition test/edgestore:0000000000002480 (127497758 bytes)
WARN 09:51:58 Compacting large partition test/edgestore:6000000000376d00 (227606217 bytes)
WARN 09:54:35 Compacting large partition test/edgestore:d000000000002b00 (124082466 bytes)
WARN 09:58:24 Compacting large partition test/edgestore:6800000000354380 (172991088 bytes)
所以问题很简单:Cassandra Compaction 会导致上述超时吗?如果会,处理这个问题的最佳方法是什么?
Can Cassandra Compaction lead to the above timeouts and if so, what is the best approach for handling this ?
是的,绝对是。根本原因可能不是压缩本身,而是因为 I/O 带宽饱和。下面是一个可能的问题链:
- 重压实
- 磁盘 I/O 跟不上速度
- 数据在内存中的停留时间更长
- 数据被提升到 JVM 老年代
- Stop-the-world 垃圾收集开始启动
- 该节点被其他节点检测为down
首先要检查的是在你的 /var/log/cassandra/system.log
中 grep 关键字 "GC" 并监视 I/O 和 CPU I/O 等待使用dstat
工具
还有配置 JVM 堆大小是多少?