确定 MarkLogic 内容数据库的优化森林数量

Determine the optimized number of forests for a MarkLogic content database

ML 内容数据库通常很大,使用 ML 集群配置为分布在多个节点(主机)上。

每个主机有多少森林? 根据下面的图表 offical clustering guide,可以看到有时一台主机可能有两个或更多森林,或者每个主机只有一个森林。 这个数字和主机的CPU核数有关吗?

如何确定内容数据库的优化林数?

没有关于森林大小和数量的硬性规定,因为有许多关于内容、使用模式和 SLA 的因素可能会影响事物。

虽然有一些通用准则: https://docs.marklogic.com/guide/cluster/scalability#id_96443

As your content grows in size, you might need to add forests to your database. There is no limit to the number of forests in a database, but there are some guidelines for individual forest sizes where, if the guidelines are greatly exceeded, then you might see performance degradation.

The numbers in these guidelines are not exact, and they can vary considerably based on the content. Rather, they are approximate, rule-of-thumb sizes. These numbers are based on average sized fragments of 10k to 100k. If your fragments are much larger on average, or if you have a lot of large binary documents, then the forests can probably be larger before running into any performance degradation.

The rule-of-thumb maximum size for a forest is 512GB. Each forest should ideally have two vCPUs of processing power available on its host, with 8GB memory per vCPU. For example, a host with eight vCPUs and 64GB memory can manage four 512GB forests. For bare-metal systems, a hardware thread (hyperthread), is equivalent to a vCPU. It is a good idea to run performance tests with your own workload and content. If you have many configured indexes you may need more memory. Memory requirements may also increase over time as projects evolve and forests grow with more content and more indexes.

如果您的内容数据库较小,则为每个主机创建一个林可能没有意义。在您达到一定规模之前,或者除非您希望达到某些性能基准,否则单个森林可能没问题。如果您预计系统会显着增长或具有高吞吐量需求,那么将负载分散到多个 D 节点可能会有所帮助。

其他可能影响森林数量的因素是,如果您有 HA 副本并希望确保在峰值负载期间发生故障转移时,故障转移主机在打开HA 复制林。每个主机有更多的森林,并条带化 HA 副本,以便在主机发生故障时,它的负载被分散到两个承担一半负载的故障转移主机上,这可能会提供更好的弹性和更流畅的性能。但这在很大程度上取决于工作负载的类型以及这些 D 节点的推动程度。

如果您可以使用真实的数据和使用场景执行性能测试并监控资源消耗并针对响应时间制定 SLA,则您可以更轻松地进行实验并确定阈值是什么以及林数何时发生变化和 D 节点会有所帮助。