在 aerospike 中获取批量读取超时：batch_queue 堆积

Getting batch-read timeouts in aerospike : batch_queue pile up

背景：

我正在使用具有 9 个节点的 aerospike 集群。集群似乎工作正常，但某些批读取会出现间歇性超时。有趣的是，超时发生在服务器端本身，仅发生在 9 个节点中的 2 个上。我怀疑关键热点是这里的问题，但似乎并非如此。

在检查服务器统计信息时，弹出的是 batch_queue 大小与超时之间的相关性。

命令：asadm -e "watch 1 100 show stat like batch"

[ 2017-09-07 20:56:10 'show stat like batch' sleep: 1.0s iteration: 47 of 100 ]

batch_queue : 586
batch_timeout : 81709

[ 2017-09-07 20:56:11 'show stat like batch' sleep: 1.0s iteration: 48 of 100 ]

batch_queue : 545
batch_timeout : 84357

[ 2017-09-07 20:56:12 'show stat like batch' sleep: 1.0s iteration: 49 of 100 ]

batch_queue : 0
batch_timeout : 88544

[ 2017-09-07 20:56:13 'show stat like batch' sleep: 1.0s iteration: 50 of 100 ]

batch_queue : 0
batch_timeout : 88544

[ 2017-09-07 20:56:14 'show stat like batch' sleep: 1.0s iteration: 51 of 100 ]

batch_queue : 0
batch_timeout : 88544

batch_queue 堆积与请求超时之间似乎有明显的相关性。

问题

这个批处理队列到底是什么。它只堆积在几个 aerospike 节点中的原因是什么？
我该如何解决？

谢谢

编辑：

http://www.aerospike.com/docs/guide/batch.html。这很好地回答了第一个问题。

我建议，如果可能（取决于客户），转向使用 batch-index。某些节点上的超时可能表示一些不同的事情：

有些节点每批获得的记录比其他节点多
这些节点（CPU、内核版本、存储、配置）的一些差异导致它们变慢
其他 activity 在那些节点上导致它们变慢（其他 read/write 事务上的热键）

基本上任何会减慢这些节点的东西，导致批处理队列堆积和一些批处理事务超时。

最后，如果您还没有这样做，您可以尝试增加 batch-threads and batch-priority。

在 aerospike 中获取批量读取超时：batch_queue 堆积

Getting batch-read timeouts in aerospike : batch_queue pile up

architecture

timeout

key-value-store

aerospike