Snowflake 上的集群键选择
Cluster key selection on Snowflake
(代表 Snowflake 用户提交)
问题:
为什么过滤器或搜索键(在 where 子句中使用的键)对于聚类键比 order by 或 group by key 是更好的选择。
一个资源推荐阅读: https://support.snowflake.net/s/article/case-study-how-clustering-can-improve-your-query-performance
另一个资源提到:
The performance of query filter will be better because the data is sorted it would skip all the rows which are not required.
For the scenario which has query filter on columns which are not part of sort order but the columns in group by and order by are part of data sort order (clustered keys), it may take time to select those data but the sorting would be easy since the data is already in an order.
第三个资源状态:
The clustering key is important for the WHERE clause when you only select a small portion of the overall data that you have in your tables, because it can reduce the amount of data that has to be read from the Storage into the Compute when the Optimizer can use the clustering key for Query Pruning.
You can alternatively use the clustering key to optimize table inserts and possibly also query output (eg sort order).
Your choice should depend on your priorities, there is no cure all unless a single key covers all above.
用户回答了以下问题:
If I always insert the rows in the order in which they will be retrieved, do I still need to create a cluster key? For example if a table is always queried using a date_timestamp and if I ensure that I am inserting in the table order by date_timestamp, do I still need to create a cluster key on date_timestamp?
有什么想法、建议等吗?谢谢!
根据FILTER/GROUP/SORT选择簇键。第一个"resource"是对的。
如果过滤器会导致修剪,那么它可能是最好的(以便可以跳过数据。)如果必须读取 all/most 的数据,那么在 GROUP/SORT 键上聚类可能很快(所以重新排序所花费的时间更少)These docs 状态:
Typically, queries benefit from clustering when the queries filter or
sort on the clustering key for the table. Sorting is commonly done for
ORDER BY operations, for GROUP BY operations, and for some joins.
对于关于自然集群的第二个问题,在这种情况下定义集群键几乎没有性能优势。
(代表 Snowflake 用户提交)
问题: 为什么过滤器或搜索键(在 where 子句中使用的键)对于聚类键比 order by 或 group by key 是更好的选择。
一个资源推荐阅读: https://support.snowflake.net/s/article/case-study-how-clustering-can-improve-your-query-performance
另一个资源提到:
The performance of query filter will be better because the data is sorted it would skip all the rows which are not required.
For the scenario which has query filter on columns which are not part of sort order but the columns in group by and order by are part of data sort order (clustered keys), it may take time to select those data but the sorting would be easy since the data is already in an order.
第三个资源状态:
The clustering key is important for the WHERE clause when you only select a small portion of the overall data that you have in your tables, because it can reduce the amount of data that has to be read from the Storage into the Compute when the Optimizer can use the clustering key for Query Pruning.
You can alternatively use the clustering key to optimize table inserts and possibly also query output (eg sort order).
Your choice should depend on your priorities, there is no cure all unless a single key covers all above.
用户回答了以下问题:
If I always insert the rows in the order in which they will be retrieved, do I still need to create a cluster key? For example if a table is always queried using a date_timestamp and if I ensure that I am inserting in the table order by date_timestamp, do I still need to create a cluster key on date_timestamp?
有什么想法、建议等吗?谢谢!
根据FILTER/GROUP/SORT选择簇键。第一个"resource"是对的。 如果过滤器会导致修剪,那么它可能是最好的(以便可以跳过数据。)如果必须读取 all/most 的数据,那么在 GROUP/SORT 键上聚类可能很快(所以重新排序所花费的时间更少)These docs 状态:
Typically, queries benefit from clustering when the queries filter or sort on the clustering key for the table. Sorting is commonly done for ORDER BY operations, for GROUP BY operations, and for some joins.
对于关于自然集群的第二个问题,在这种情况下定义集群键几乎没有性能优势。