在 citus 上测量 tpc-ds 基准

Question

我正在尝试对 citus（postgres 的扩展）进行一些测量。对于那个任务，我运行ning 在 citus 上进行 tpc-ds 查询。我正在使用的 citus 是从这里获取的 master、worker 和 manager 的容器：https://github.com/citusdata/docker 我可以通过添加他们的容器来添加工人。到目前为止一切顺利，但我在进行测量时遇到了麻烦，需要一些答案：

要使用我需要运行 select_distributed_table/select_reference _table 的所有工作人员。是将所有数据复制给所有工作人员（例如 1TB 的数据对于 16 个工作人员变成 16TB）吗？
如果我不使用 select_distributed_table 但添加工作人员，该操作有什么好处吗？
如果我已经运行 select_distributed_table 并且稍后添加了工作人员，它会分发数据还是我需要再次运行 select_distributed_table？

Answer 1

to use all worker I need to run select_distributed_table/select_reference _table. is that copy the all data to all workers (for example 1TB of data became 16 TB for 16 workers)?

引用 table 被复制到整个集群，分布式 table 被跨工作节点分片。

如果您运行在具有 16 名工作人员的 Citus 集群上执行以下查询 tables 和 16 GB 数据

SELECT create_reference_table('ref_table');
SELECT create_distributed_table('dist_table','partition_column_name');

那么你的每个工作节点将在 dist_table 中总共拥有约 1 GB 的数据，在 ref_table 中总共有 16 GB 的数据。

if I not using select_distributed_table but adding worker is there any benefit to that action?

如果您不执行重新平衡操作，或手动将分片移动到新节点，添加新节点通常对您没有帮助。新节点将包含集群中的所有分布式对象（用户、函数、模式、类型等）和引用副本 tables。唯一会命中这些新工作节点的查询将是那些仅访问引用 tables.

的查询

If I already run select_distributed_table and later added worker, do it get the data distributed or I need to run again select_distributed_table?

如果您运行 SELECT create_distributed_table('events','id') 则您将在当前工作节点上创建分片。如果您稍后添加一些新节点，除非您进行重新平衡，否则您将看不到 events table 的任何分片。

但是，如果您运行 SELECT create_reference_table('customers') 那么您将在集群的所有节点中看到 customers 中所有数据的副本。

在 citus 上测量 tpc-ds 基准

measure tpc-ds benchmark on citus

postgresql

citus