雪花：对存储在变体字段中的日期时间键进行聚类不起作用/进行分区修剪

Question

我们正在通过 kafka connector 将数据摄取到 Snowflake 中。为了提高数据读取性能/扫描较少的分区，我们决定将 clustering key 添加到存储在 RECORD_CONTENT 变体字段中的键/键组合。

RECORD_CONTENT 字段中的数据如下所示：

{
  "jsonSrc": {
    "Integerfield": 1,
    "SourceDateTime": "2020-06-30 05:33:08:345",
    *REST_OF_THE_KEY_VALUE_PAIRS*
}

现在，问题是像 SourceDateTime 这样的日期时间列上的聚类确实 NOT 工作：

CLUSTER BY (to_date(RECORD_CONTENT:jsonSrc:loadDts::datetime))

...在像 Integerfield DOES 这样的字段上进行聚类时：

CLUSTER BY (RECORD_CONTENT:jsonSrc:Integerfield::int )

无效意味着：在RECORD_CONTENT:jsonSrc:loadDts::datetime上使用过滤器时，它无效在扫描的分区上，同时在 RECORD_CONTENT:jsonSrc:Integerfield::int 上过滤确实执行分区修剪。

这里有什么问题？这是一个错误吗？

注意：

有足够数据在RECORD_CONTENT:jsonSrc:loadDts::datetime
我通过制作原始 table 的副本来验证 RECORD_CONTENT:jsonSrc:loadDts::datetime 上的聚类工作，RECORD_CONTENT:jsonSrc:loadDts::datetime 在 单独的列 loadDtsCol 然后在该列上添加一个类似的集群键：to_date(loadDtsCol).

Answer 1

For better pruning and less storage consumption, we recommend flattening your object and key data into separate relational columns if your semi-structured data includes: Dates and timestamps, especially non-ISO 8601dates and timestamps, as string values

Numbers within strings

Arrays

Non-native values such as dates and timestamps are stored as strings when loaded into a VARIANT column, so operations on these values could be slower and also consume more space than when stored in a relational column with the corresponding data type.

看到这个link：https://docs.snowflake.com/en/user-guide/semistructured-considerations.html#storing-semi-structured-data-in-a-variant-column-vs-flattening-the-nested-structure

雪花：对存储在变体字段中的日期时间键进行聚类不起作用/进行分区修剪

Snowflake: clustering on datetime key stored in variant field does not work / do partition pruning

database-performance

clustering-key

snowflake-cloud-data-platform