Sphinx RT-index 并行更新
Sphinx RT-index updating in parallel
由于到处都找不到,请问是否可以并行更新Sphinx的RT-index?
例如,我注意到当文档超过 1.000.000 字时处理速度会降低。因此,我想将我的处理器拆分成一个单独的线程来处理超过 1.000.000 个单词的文档,而不是阻止较小的文档被处理。
但是,我还没有找到任何并行更新 RT-index 的基准。我也没有找到它的任何文档?
是否还有其他人在使用这种方法,或者它被认为是不好的做法?
首先让我提醒您,当您在 Sphinx(实际上也是 manticore search/lucene/solr/elastic)实时索引中更新 smth 时,您实际上并没有更新任何内容,您只是将更改添加到新段(在 Sphinx 的情况下是 RAM 块)最终(主要是更晚)将与其他段合并,并且更改将真正应用。因此,问题是用新记录填充 RT RAM 块的速度有多快,以及并发性如何改变吞吐量。我已经根据 https://github.com/Ivinco/stress-tester 进行了测试,这就是我得到的结果:
snikolaev@dev:~/stress_tester_github$ for conc in 1 2 5 8 11; do ./test.php --plugin=rt_insert.php -b=100 --data=/home/snikolaev/hacker_news_comments.smaller.csv -c=$conc --limit=100000 --csv; done;
concurrency;batch size;total time;throughput;elements count;latencies count;avg latency, ms;median latency, ms;95p latency, ms;99p latency, ms
1;100;28.258;3537;100000;99957;0.275;0.202;0.519;1.221
concurrency;batch size;total time;throughput;elements count;latencies count;avg latency, ms;median latency, ms;95p latency, ms;99p latency, ms
2;100;18.811;5313;100000;99957;0.34;0.227;0.673;2.038
concurrency;batch size;total time;throughput;elements count;latencies count;avg latency, ms;median latency, ms;95p latency, ms;99p latency, ms
5;100;16.751;5967;100000;99957;0.538;0.326;1.163;3.797
concurrency;batch size;total time;throughput;elements count;latencies count;avg latency, ms;median latency, ms;95p latency, ms;99p latency, ms
8;100;20.576;4857;100000;99957;0.739;0.483;1.679;5.527
concurrency;batch size;total time;throughput;elements count;latencies count;avg latency, ms;median latency, ms;95p latency, ms;99p latency, ms
11;100;23.55;4244;100000;99957;0.862;0.54;2.102;5.849
即将并发数从 1 增加到 11(在我的例子中是在 8 核服务器上)可以让您将吞吐量从每秒 3500 个文档增加到 4200 个文档。 IE。 20% - 还不错,但性能提升不是很大。
在你的情况下,也许另一种方法可以解决——你可以更新多个索引,而不是一个,然后有一个分布式索引来组合它们。在其他情况下,您可以执行所谓的分片。例如,如果您写入两个 RT 索引而不是一个,您可以得到:
snikolaev@dev:~/stress_tester_github$ for conc in 1 2 5 8 11; do ./test.php --plugin=rt_insert.php -b=100 --data=/home/snikolaev/hacker_news_comments.smaller.csv -c=$conc --limit=100000 --csv; done;
concurrency;batch size;total time;throughput;elements count;latencies count;avg latency, ms;median latency, ms;95p latency, ms;99p latency, ms
1;100;28.083;3559;100000;99957;0.274;0.206;0.514;1.223
concurrency;batch size;total time;throughput;elements count;latencies count;avg latency, ms;median latency, ms;95p latency, ms;99p latency, ms
2;100;18.03;5543;100000;99957;0.328;0.225;0.653;1.919
concurrency;batch size;total time;throughput;elements count;latencies count;avg latency, ms;median latency, ms;95p latency, ms;99p latency, ms
5;100;15.07;6633;100000;99957;0.475;0.264;1.066;3.821
concurrency;batch size;total time;throughput;elements count;latencies count;avg latency, ms;median latency, ms;95p latency, ms;99p latency, ms
8;100;18.608;5371;100000;99957;0.613;0.328;1.479;4.897
concurrency;batch size;total time;throughput;elements count;latencies count;avg latency, ms;median latency, ms;95p latency, ms;99p latency, ms
11;100;26.071;3833;100000;99957;0.632;0.294;1.652;4.729
即并发 5 时每秒 6600 个文档。现在它比初始吞吐量提高了近 90%,这似乎是一个不错的结果。使用索引和并发的数量,您可以找到适合您的情况的最佳设置。
由于到处都找不到,请问是否可以并行更新Sphinx的RT-index?
例如,我注意到当文档超过 1.000.000 字时处理速度会降低。因此,我想将我的处理器拆分成一个单独的线程来处理超过 1.000.000 个单词的文档,而不是阻止较小的文档被处理。
但是,我还没有找到任何并行更新 RT-index 的基准。我也没有找到它的任何文档?
是否还有其他人在使用这种方法,或者它被认为是不好的做法?
首先让我提醒您,当您在 Sphinx(实际上也是 manticore search/lucene/solr/elastic)实时索引中更新 smth 时,您实际上并没有更新任何内容,您只是将更改添加到新段(在 Sphinx 的情况下是 RAM 块)最终(主要是更晚)将与其他段合并,并且更改将真正应用。因此,问题是用新记录填充 RT RAM 块的速度有多快,以及并发性如何改变吞吐量。我已经根据 https://github.com/Ivinco/stress-tester 进行了测试,这就是我得到的结果:
snikolaev@dev:~/stress_tester_github$ for conc in 1 2 5 8 11; do ./test.php --plugin=rt_insert.php -b=100 --data=/home/snikolaev/hacker_news_comments.smaller.csv -c=$conc --limit=100000 --csv; done;
concurrency;batch size;total time;throughput;elements count;latencies count;avg latency, ms;median latency, ms;95p latency, ms;99p latency, ms
1;100;28.258;3537;100000;99957;0.275;0.202;0.519;1.221
concurrency;batch size;total time;throughput;elements count;latencies count;avg latency, ms;median latency, ms;95p latency, ms;99p latency, ms
2;100;18.811;5313;100000;99957;0.34;0.227;0.673;2.038
concurrency;batch size;total time;throughput;elements count;latencies count;avg latency, ms;median latency, ms;95p latency, ms;99p latency, ms
5;100;16.751;5967;100000;99957;0.538;0.326;1.163;3.797
concurrency;batch size;total time;throughput;elements count;latencies count;avg latency, ms;median latency, ms;95p latency, ms;99p latency, ms
8;100;20.576;4857;100000;99957;0.739;0.483;1.679;5.527
concurrency;batch size;total time;throughput;elements count;latencies count;avg latency, ms;median latency, ms;95p latency, ms;99p latency, ms
11;100;23.55;4244;100000;99957;0.862;0.54;2.102;5.849
即将并发数从 1 增加到 11(在我的例子中是在 8 核服务器上)可以让您将吞吐量从每秒 3500 个文档增加到 4200 个文档。 IE。 20% - 还不错,但性能提升不是很大。
在你的情况下,也许另一种方法可以解决——你可以更新多个索引,而不是一个,然后有一个分布式索引来组合它们。在其他情况下,您可以执行所谓的分片。例如,如果您写入两个 RT 索引而不是一个,您可以得到:
snikolaev@dev:~/stress_tester_github$ for conc in 1 2 5 8 11; do ./test.php --plugin=rt_insert.php -b=100 --data=/home/snikolaev/hacker_news_comments.smaller.csv -c=$conc --limit=100000 --csv; done;
concurrency;batch size;total time;throughput;elements count;latencies count;avg latency, ms;median latency, ms;95p latency, ms;99p latency, ms
1;100;28.083;3559;100000;99957;0.274;0.206;0.514;1.223
concurrency;batch size;total time;throughput;elements count;latencies count;avg latency, ms;median latency, ms;95p latency, ms;99p latency, ms
2;100;18.03;5543;100000;99957;0.328;0.225;0.653;1.919
concurrency;batch size;total time;throughput;elements count;latencies count;avg latency, ms;median latency, ms;95p latency, ms;99p latency, ms
5;100;15.07;6633;100000;99957;0.475;0.264;1.066;3.821
concurrency;batch size;total time;throughput;elements count;latencies count;avg latency, ms;median latency, ms;95p latency, ms;99p latency, ms
8;100;18.608;5371;100000;99957;0.613;0.328;1.479;4.897
concurrency;batch size;total time;throughput;elements count;latencies count;avg latency, ms;median latency, ms;95p latency, ms;99p latency, ms
11;100;26.071;3833;100000;99957;0.632;0.294;1.652;4.729
即并发 5 时每秒 6600 个文档。现在它比初始吞吐量提高了近 90%,这似乎是一个不错的结果。使用索引和并发的数量,您可以找到适合您的情况的最佳设置。