雪花：2个相关的列有非常不同的聚类信息（一个完美，另一个糟糕）

Question

我们有一个 table 有 120M 行（超过 2222 个微分区），它有 2 个重要的列，record_id 的值格式为 prefix|<account_id>|<uuid>（唯一）和列 account_id，其值为<account_id>。请注意，所有记录的前缀都相同。当然还有一些事实专栏，但这不相关。

Snowflake 通过 clustering_information 函数显示 record_id 列的完美聚类（由 SF 自动选择，我们未设置指定的聚类）：

"total_partition_count" : 2222,
 "total_constant_partition_count" : 2222,
 "average_overlaps" : 24.0,
 "average_depth" : 25.0,

但是，对于列 account_id，聚类非常糟糕

 "total_constant_partition_count" : 0,
 "average_overlaps" : 2221.0,
 "average_depth" : 2222.0,

大约有 130 个不同的帐户 ID，这意味着平均而言，一个 account_id 的记录应该超过 17 个分区。即使雪花簇由 records_id，该列的开头 (prefix|<account_id>) 与 account_id 列相关。因此，具有相同 account_id 的记录应该最终位于相同的分区中。因此，我无法弄清楚为什么 account_id 列的微分区有 100% 的重叠。这就像 snowflake 对 record_id 列使用了一些奇怪的排序，因此分散了所有分区中每个帐户的行。这可能吗？

这会对性能产生负面影响，因为使用 account_id 过滤器进行查询会导致扫描所有分区。

注意：也在雪花论坛上问过这个问题 https://support.snowflake.net/s/question/0D50Z00008vfglCSAQ/2-correlated-columns-have-very-different-clustering-information-one-has-perfect-the-other-has-terrible

Answer 1

在像上面发布的那样的 Snowflake 聚类报告功能中，存在一个限制，即仅考虑 varchar 的前 6 个字符来评估聚类深度。所以我不相信 record_id 报告的好结果，因为前 6 个字符可能由于前缀而相同，即使随后的 account_id 是随机的。

最好的解决方案是在 account_id 上显式声明集群并在 table 上激活自动集群。

雪花：2个相关的列有非常不同的聚类信息（一个完美，另一个糟糕）

Snowflake: 2 correlated columns have very different clustering information (one has perfect, the other has terrible)

clustered-index

snowflake-cloud-data-platform