Cassandra 集群 - 特定节点 - 特定 table 高丢弃突变
Cassandra Cluster - Specific Node - specific table high Dropped Mutations
我在生产中的压缩策略是 LZ4 压缩。但我将其修改为 Deflate
对于压缩更改,我们不得不使用nodetool Upgradesstables强制升级所有sstables
上的压缩策略
但是一旦 upgradesstabloes 命令在集群中的所有 5 个节点上完成,我的请求开始失败,包括读取和写入
The issue is traced to a specific node out of the 5 node cluster and
to a spcific table on that node. My whole cluster has roughly same
amount of data and configuration , but 1 node in particular goes down
is misbehaving
nodetool status
的输出
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN xx.xxx.xx.xxx 283.94 GiB 256 40.4% 24950207-5fbc-4ea6-92aa-d09f37e83a1c rack1
UN xx.xxx.xx.xxx 280.55 GiB 256 39.9% 4ecdf7f8-a4d8-4a94-a930-1a87a80ae510 rack1
UN xx.xxx.xx.xxx 284.61 GiB 256 40.5% de2ada08-264b-421a-961f-5fd113f28208 rack1
UN YY.YYY.YY.YYY 280.44 GiB 256 40.2% 68c7c130-6cf8-4864-bde8-1819f238045c rack2
UN xx.xxx.xx.xxx 273.71 GiB 256 39.0% 6c080e47-ffb2-4fbc-bc7e-73df19103d2a rack2
以上 YY.YYY.YY.YYY
节点有错误
Cluster Configuration
- 复制因子 -> 2
- 读取一致性 -> 1
- 写入一致性 -> 1
- 仅供参考,我也在使用轻量级事务 Cassandra 版本 3.10
Nodetool tablestats
对于那个特定的 table 显示高掉落的突变
SSTable count: 11
Space used (live): 9.82 GiB
Space used (total): 9.82 GiB
Space used by snapshots (total): 0 bytes
Off heap memory used (total): 26.77 MiB
SSTable Compression Ratio: 0.1840953951763564
Number of keys (estimate): 15448921
Memtable cell count: 8558
Memtable data size: 5.89 MiB
Memtable off heap memory used: 0 bytes
Memtable switch count: 5
Local read count: 67792
Local read latency: 92.314 ms
Local write count: 31336
Local write latency: 0.067 ms
Pending flushes: 0
Percent repaired: 21.18
Bloom filter false positives: 1
Bloom filter false ratio: 0.00794
Bloom filter space used: 22.2 MiB
Bloom filter off heap memory used: 18.45 MiB
Index summary off heap memory used: 3.24 MiB
Compression metadata off heap memory used: 5.08 MiB
Compacted partition minimum bytes: 87
Compacted partition maximum bytes: 943127
Compacted partition mean bytes: 3058
Average live cells per slice (last five minutes): 1.0
Maximum live cells per slice (last five minutes): 1
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1
Dropped Mutations: 4.13 KiB
nodetool info
显示
Gossip active : true
Thrift active : false
Native Transport active: true
Load : 280.43 GiB
Generation No : 1514537104
Uptime (seconds) : 8810363
Heap Memory (MB) : 1252.06 / 3970.00
Off Heap Memory (MB) : 573.33
Data Center : dc1
Rack : rack1
Exceptions : 18987
Key Cache : entries 351612, size 99.86 MiB, capacity 100 MiB, 11144584 hits, 21126425 requests, 0.528 recent hit rate, 14400 save period in seconds
在 5 个节点中,一个特定节点具有很高的 Dropped Mutation "Around 560Kb" 和读取,即使该节点与另一个节点具有相同的配置并且拥有相同数量的数据。
我们曾尝试修复该节点,但这并没有降低掉落的突变,请求一直失败。
我们在该节点上重新启动了 cassandra 服务,但丢弃的突变仍在增加
System.logs
ERROR [ReadRepairStage:10229] 2018-04-11 16:02:12,954 CassandraDaemon.java:229 - Exception in thread Thread[ReadRepairStage:10229,5,main]
org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
at org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:171) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.db.partitions.UnfilteredPartitionIterators.close(UnfilteredPartitionIterators.java:182) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.service.AsyncRepairCallback.runMayThrow(AsyncRepairCallback.java:50) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[apache-cassandra-3.10.jar:3.10]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_144]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_144]
at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator[=14=](NamedThreadFactory.java:79) ~[apache-cassandra-3.10.jar:3.10]
at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_144]
ERROR [ReadRepairStage:10231] 2018-04-11 16:02:17,551 CassandraDaemon.java:229 - Exception in thread Thread[ReadRepairStage:10231,5,main]
org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
at org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:171) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.db.partitions.UnfilteredPartitionIterators.close(UnfilteredPartitionIterators.java:182) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.service.AsyncRepairCallback.runMayThrow(AsyncRepairCallback.java:50) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[apache-cassandra-3.10.jar:3.10]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_144]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_144]
at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator[=14=](NamedThreadFactory.java:79) ~[apache-cassandra-3.10.jar:3.10]
at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_144]
ERROR [ReadRepairStage:10232] 2018-04-11 16:02:22,221 CassandraDaemon.java:229 - Exception in thread Thread[ReadRepairStage:10232,5,main]
org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
at org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:171) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.db.partitions.UnfilteredPartitionIterators.close(UnfilteredPartitionIterators.java:182) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.service.AsyncRepairCallback.runMayThrow(AsyncRepairCallback.java:50) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[apache-cassandra-3.10.jar:3.10]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_144]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_144]
at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator[=14=](NamedThreadFactory.java:79) ~[apache-cassandra-3.10.jar:3.10]
at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_144]
Debug.Logs
DEBUG [ReadRepairStage:161301] 2018-04-11 01:45:01,432 DataResolver.java:169 - Timeout while read-repairing after receiving all 1 data and digest responses
ERROR [ReadRepairStage:161301] 2018-04-11 01:45:01,432 CassandraDaemon.java:229 - Exception in thread Thread[ReadRepairStage:161301,5,main]
org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
at org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:171) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.db.partitions.UnfilteredPartitionIterators.close(UnfilteredPartitionIterators.java:182) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.service.AsyncRepairCallback.runMayThrow(AsyncRepairCallback.java:50) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[apache-cassandra-3.10.jar:3.10]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_144]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_144]
at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator[=15=](NamedThreadFactory.java:79) ~[apache-cassandra-3.10.jar:3.10]
at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_144]
DEBUG [ReadRepairStage:161304] 2018-04-11 01:45:02,692 ReadCallback.java:242 - Digest mismatch:
org.apache.cassandra.service.DigestMismatchException: Mismatch for key DecoratedKey(-4042387324575455696, 229229902e5a43588d52466b8063b557) (d41d8cd98f00b204e9800998ecf8427e vs 4662dce3dcb05114ed670fbc40291d53)
at org.apache.cassandra.service.DigestResolver.compareResponses(DigestResolver.java:92) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:233) ~[apache-cassandra-3.10.jar:3.10]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_144]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_144]
at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator[=15=](NamedThreadFactory.java:79) [apache-cassandra-3.10.jar:3.10]
at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_144]
DEBUG [GossipStage:1] 2018-04-11 01:45:02,958 FailureDetector.java:457 - Ignoring interval time of 2000158817 for /xx.xxx.xx.xxx
WARN [PERIODIC-COMMIT-LOG-SYNCER] 2018-04-11 01:45:04,665 NoSpamLogger.java:94 - Out of 1 commit log syncs over the past 0.00s with average duration of 180655.05ms, 1 have exceeded the configured commit interval by an average of 170655.05ms
DEBUG [ReadRepairStage:161303] 2018-04-11 01:45:04,693 DataResolver.java:169 - Timeout while read-repairing after receiving all 1 data and digest responses
ERROR [ReadRepairStage:161303] 2018-04-11 01:45:04,709 CassandraDaemon.java:229 - Exception in thread Thread[ReadRepairStage:161303,5,main]
org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
at org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:171) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.db.partitions.UnfilteredPartitionIterators.close(UnfilteredPartitionIterators.java:182) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.service.AsyncRepairCallback.runMayThrow(AsyncRepairCallback.java:50) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[apache-cassandra-3.10.jar:3.10]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_144]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_144]
at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator[=15=](NamedThreadFactory.java:79) ~[apache-cassandra-3.10.jar:3.10]
at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_144]
INFO [ScheduledTasks:1] 2018-04-11 01:45:07,353 MessagingService.java:1214 - MUTATION messages were dropped in last 5000 ms: 87 internal and 77 cross node. Mean internal dropped latency: 89509 ms and Mean cross-node dropped latency: 95871 ms
INFO [ScheduledTasks:1] 2018-04-11 01:45:07,354 MessagingService.java:1214 - HINT messages were dropped in last 5000 ms: 0 internal and 93 cross node. Mean internal dropped latency: 0 ms and Mean cross-node dropped latency: 86440 ms
INFO [ScheduledTasks:1] 2018-04-11 01:45:07,354 MessagingService.java:1214 - READ_REPAIR messages were dropped in last 5000 ms: 0 internal and 72 cross node. Mean internal dropped latency: 0 ms and Mean cross-node dropped latency: 73159 ms
希望有人能帮我解决这个问题。
更新:
Nodetool info
将此节点的堆大小更新为 9GB 后。
ID : 68c7c130-6cf8-4864-bde8-1819f238045c
Gossip active : true
Thrift active : false
Native Transport active: true
Load : 279.32 GiB
Generation No : 1523504294
Uptime (seconds) : 9918
Heap Memory (MB) : 5856.73 / 9136.00
Off Heap Memory (MB) : 569.67
Data Center : dc1
Rack : rack2
Exceptions : 862
Key Cache : entries 3650, size 294.83 KiB, capacity 100 MiB, 8112 hits, 22015 requests, 0.368 recent hit rate, 14400 save period in seconds
Row Cache : entries 0, size 0 bytes, capacity 0 bytes, 0 hits, 0 requests, NaN recent hit rate, 0 save period in seconds
Counter Cache : entries 0, size 0 bytes, capacity 50 MiB, 0 hits, 0 requests, NaN recent hit rate, 7200 save period in seconds
Chunk Cache : entries 7680, size 480 MiB, capacity 480 MiB, 1282773 misses, 1292444 requests, 0.007 recent hit rate, 3797.874 microseconds miss latency
Percent Repaired : 6.190888093280888%
Token : (invoke with -T/--tokens to see all 256 tokens)
1) 您使用的是 3.10,您应该强烈考虑 3.11.2。 3.11.2
修复了很多严重的错误
2) 如果您有一个节点行为不当,并且 RF=3,那么很可能您对该节点的处理方式与其他节点不同。可能是您的应用程序只连接到一台主机,协调成本不堪重负,或者由于某些错误配置,您可能拥有不成比例的数据量(看起来您的 RF=3 和 2 个机架,因此它肯定有可能没有按照您的预期完全分布)。
我们自己遇到了这个问题,我们解决了这个 (作为最后的手段) 通过从集群中删除节点(我们相信那里是某种未知的硬件故障或此类内存泄漏)
我们建议您使用 nodetool removenode
而不是 nodetool decomission
删除节点,因为我们不想从故障节点流式传输数据,而是从其中一个副本流式传输数据。
(这是一次安全检查,以避免将损坏的数据流式传输到其他节点的可能性。)
我们删除节点后,集群健康状况恢复正常并且运行正常。
我在生产中的压缩策略是 LZ4 压缩。但我将其修改为 Deflate
对于压缩更改,我们不得不使用nodetool Upgradesstables强制升级所有sstables
上的压缩策略但是一旦 upgradesstabloes 命令在集群中的所有 5 个节点上完成,我的请求开始失败,包括读取和写入
The issue is traced to a specific node out of the 5 node cluster and to a spcific table on that node. My whole cluster has roughly same amount of data and configuration , but 1 node in particular goes down is misbehaving
nodetool status
的输出
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN xx.xxx.xx.xxx 283.94 GiB 256 40.4% 24950207-5fbc-4ea6-92aa-d09f37e83a1c rack1
UN xx.xxx.xx.xxx 280.55 GiB 256 39.9% 4ecdf7f8-a4d8-4a94-a930-1a87a80ae510 rack1
UN xx.xxx.xx.xxx 284.61 GiB 256 40.5% de2ada08-264b-421a-961f-5fd113f28208 rack1
UN YY.YYY.YY.YYY 280.44 GiB 256 40.2% 68c7c130-6cf8-4864-bde8-1819f238045c rack2
UN xx.xxx.xx.xxx 273.71 GiB 256 39.0% 6c080e47-ffb2-4fbc-bc7e-73df19103d2a rack2
以上 YY.YYY.YY.YYY
节点有错误
Cluster Configuration
- 复制因子 -> 2
- 读取一致性 -> 1
- 写入一致性 -> 1
- 仅供参考,我也在使用轻量级事务 Cassandra 版本 3.10
Nodetool tablestats
对于那个特定的 table 显示高掉落的突变
SSTable count: 11
Space used (live): 9.82 GiB
Space used (total): 9.82 GiB
Space used by snapshots (total): 0 bytes
Off heap memory used (total): 26.77 MiB
SSTable Compression Ratio: 0.1840953951763564
Number of keys (estimate): 15448921
Memtable cell count: 8558
Memtable data size: 5.89 MiB
Memtable off heap memory used: 0 bytes
Memtable switch count: 5
Local read count: 67792
Local read latency: 92.314 ms
Local write count: 31336
Local write latency: 0.067 ms
Pending flushes: 0
Percent repaired: 21.18
Bloom filter false positives: 1
Bloom filter false ratio: 0.00794
Bloom filter space used: 22.2 MiB
Bloom filter off heap memory used: 18.45 MiB
Index summary off heap memory used: 3.24 MiB
Compression metadata off heap memory used: 5.08 MiB
Compacted partition minimum bytes: 87
Compacted partition maximum bytes: 943127
Compacted partition mean bytes: 3058
Average live cells per slice (last five minutes): 1.0
Maximum live cells per slice (last five minutes): 1
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1
Dropped Mutations: 4.13 KiB
nodetool info
显示
Gossip active : true
Thrift active : false
Native Transport active: true
Load : 280.43 GiB
Generation No : 1514537104
Uptime (seconds) : 8810363
Heap Memory (MB) : 1252.06 / 3970.00
Off Heap Memory (MB) : 573.33
Data Center : dc1
Rack : rack1
Exceptions : 18987
Key Cache : entries 351612, size 99.86 MiB, capacity 100 MiB, 11144584 hits, 21126425 requests, 0.528 recent hit rate, 14400 save period in seconds
在 5 个节点中,一个特定节点具有很高的 Dropped Mutation "Around 560Kb" 和读取,即使该节点与另一个节点具有相同的配置并且拥有相同数量的数据。
我们曾尝试修复该节点,但这并没有降低掉落的突变,请求一直失败。
我们在该节点上重新启动了 cassandra 服务,但丢弃的突变仍在增加
System.logs
ERROR [ReadRepairStage:10229] 2018-04-11 16:02:12,954 CassandraDaemon.java:229 - Exception in thread Thread[ReadRepairStage:10229,5,main]
org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
at org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:171) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.db.partitions.UnfilteredPartitionIterators.close(UnfilteredPartitionIterators.java:182) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.service.AsyncRepairCallback.runMayThrow(AsyncRepairCallback.java:50) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[apache-cassandra-3.10.jar:3.10]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_144]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_144]
at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator[=14=](NamedThreadFactory.java:79) ~[apache-cassandra-3.10.jar:3.10]
at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_144]
ERROR [ReadRepairStage:10231] 2018-04-11 16:02:17,551 CassandraDaemon.java:229 - Exception in thread Thread[ReadRepairStage:10231,5,main]
org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
at org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:171) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.db.partitions.UnfilteredPartitionIterators.close(UnfilteredPartitionIterators.java:182) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.service.AsyncRepairCallback.runMayThrow(AsyncRepairCallback.java:50) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[apache-cassandra-3.10.jar:3.10]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_144]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_144]
at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator[=14=](NamedThreadFactory.java:79) ~[apache-cassandra-3.10.jar:3.10]
at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_144]
ERROR [ReadRepairStage:10232] 2018-04-11 16:02:22,221 CassandraDaemon.java:229 - Exception in thread Thread[ReadRepairStage:10232,5,main]
org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
at org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:171) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.db.partitions.UnfilteredPartitionIterators.close(UnfilteredPartitionIterators.java:182) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.service.AsyncRepairCallback.runMayThrow(AsyncRepairCallback.java:50) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[apache-cassandra-3.10.jar:3.10]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_144]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_144]
at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator[=14=](NamedThreadFactory.java:79) ~[apache-cassandra-3.10.jar:3.10]
at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_144]
Debug.Logs
DEBUG [ReadRepairStage:161301] 2018-04-11 01:45:01,432 DataResolver.java:169 - Timeout while read-repairing after receiving all 1 data and digest responses
ERROR [ReadRepairStage:161301] 2018-04-11 01:45:01,432 CassandraDaemon.java:229 - Exception in thread Thread[ReadRepairStage:161301,5,main]
org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
at org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:171) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.db.partitions.UnfilteredPartitionIterators.close(UnfilteredPartitionIterators.java:182) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.service.AsyncRepairCallback.runMayThrow(AsyncRepairCallback.java:50) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[apache-cassandra-3.10.jar:3.10]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_144]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_144]
at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator[=15=](NamedThreadFactory.java:79) ~[apache-cassandra-3.10.jar:3.10]
at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_144]
DEBUG [ReadRepairStage:161304] 2018-04-11 01:45:02,692 ReadCallback.java:242 - Digest mismatch:
org.apache.cassandra.service.DigestMismatchException: Mismatch for key DecoratedKey(-4042387324575455696, 229229902e5a43588d52466b8063b557) (d41d8cd98f00b204e9800998ecf8427e vs 4662dce3dcb05114ed670fbc40291d53)
at org.apache.cassandra.service.DigestResolver.compareResponses(DigestResolver.java:92) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:233) ~[apache-cassandra-3.10.jar:3.10]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_144]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_144]
at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator[=15=](NamedThreadFactory.java:79) [apache-cassandra-3.10.jar:3.10]
at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_144]
DEBUG [GossipStage:1] 2018-04-11 01:45:02,958 FailureDetector.java:457 - Ignoring interval time of 2000158817 for /xx.xxx.xx.xxx
WARN [PERIODIC-COMMIT-LOG-SYNCER] 2018-04-11 01:45:04,665 NoSpamLogger.java:94 - Out of 1 commit log syncs over the past 0.00s with average duration of 180655.05ms, 1 have exceeded the configured commit interval by an average of 170655.05ms
DEBUG [ReadRepairStage:161303] 2018-04-11 01:45:04,693 DataResolver.java:169 - Timeout while read-repairing after receiving all 1 data and digest responses
ERROR [ReadRepairStage:161303] 2018-04-11 01:45:04,709 CassandraDaemon.java:229 - Exception in thread Thread[ReadRepairStage:161303,5,main]
org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
at org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:171) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.db.partitions.UnfilteredPartitionIterators.close(UnfilteredPartitionIterators.java:182) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.service.AsyncRepairCallback.runMayThrow(AsyncRepairCallback.java:50) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[apache-cassandra-3.10.jar:3.10]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_144]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_144]
at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator[=15=](NamedThreadFactory.java:79) ~[apache-cassandra-3.10.jar:3.10]
at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_144]
INFO [ScheduledTasks:1] 2018-04-11 01:45:07,353 MessagingService.java:1214 - MUTATION messages were dropped in last 5000 ms: 87 internal and 77 cross node. Mean internal dropped latency: 89509 ms and Mean cross-node dropped latency: 95871 ms
INFO [ScheduledTasks:1] 2018-04-11 01:45:07,354 MessagingService.java:1214 - HINT messages were dropped in last 5000 ms: 0 internal and 93 cross node. Mean internal dropped latency: 0 ms and Mean cross-node dropped latency: 86440 ms
INFO [ScheduledTasks:1] 2018-04-11 01:45:07,354 MessagingService.java:1214 - READ_REPAIR messages were dropped in last 5000 ms: 0 internal and 72 cross node. Mean internal dropped latency: 0 ms and Mean cross-node dropped latency: 73159 ms
希望有人能帮我解决这个问题。
更新:
Nodetool info
将此节点的堆大小更新为 9GB 后。
ID : 68c7c130-6cf8-4864-bde8-1819f238045c
Gossip active : true
Thrift active : false
Native Transport active: true
Load : 279.32 GiB
Generation No : 1523504294
Uptime (seconds) : 9918
Heap Memory (MB) : 5856.73 / 9136.00
Off Heap Memory (MB) : 569.67
Data Center : dc1
Rack : rack2
Exceptions : 862
Key Cache : entries 3650, size 294.83 KiB, capacity 100 MiB, 8112 hits, 22015 requests, 0.368 recent hit rate, 14400 save period in seconds
Row Cache : entries 0, size 0 bytes, capacity 0 bytes, 0 hits, 0 requests, NaN recent hit rate, 0 save period in seconds
Counter Cache : entries 0, size 0 bytes, capacity 50 MiB, 0 hits, 0 requests, NaN recent hit rate, 7200 save period in seconds
Chunk Cache : entries 7680, size 480 MiB, capacity 480 MiB, 1282773 misses, 1292444 requests, 0.007 recent hit rate, 3797.874 microseconds miss latency
Percent Repaired : 6.190888093280888%
Token : (invoke with -T/--tokens to see all 256 tokens)
1) 您使用的是 3.10,您应该强烈考虑 3.11.2。 3.11.2
修复了很多严重的错误2) 如果您有一个节点行为不当,并且 RF=3,那么很可能您对该节点的处理方式与其他节点不同。可能是您的应用程序只连接到一台主机,协调成本不堪重负,或者由于某些错误配置,您可能拥有不成比例的数据量(看起来您的 RF=3 和 2 个机架,因此它肯定有可能没有按照您的预期完全分布)。
我们自己遇到了这个问题,我们解决了这个 (作为最后的手段) 通过从集群中删除节点(我们相信那里是某种未知的硬件故障或此类内存泄漏)
我们建议您使用 nodetool removenode
而不是 nodetool decomission
删除节点,因为我们不想从故障节点流式传输数据,而是从其中一个副本流式传输数据。
(这是一次安全检查,以避免将损坏的数据流式传输到其他节点的可能性。)
我们删除节点后,集群健康状况恢复正常并且运行正常。