雪花:对存储在变体字段中的日期时间键进行聚类不起作用/进行分区修剪

Snowflake: clustering on datetime key stored in variant field does not work / do partition pruning

我们正在通过 kafka connector 将数据摄取到 Snowflake 中。 为了提高数据读取性能/扫描较少的分区,我们决定将 clustering key 添加到存储在 RECORD_CONTENT 变体字段中的键/键组合。

RECORD_CONTENT 字段中的数据如下所示:

{
  "jsonSrc": {
    "Integerfield": 1,
    "SourceDateTime": "2020-06-30 05:33:08:345",
    *REST_OF_THE_KEY_VALUE_PAIRS*
}

现在,问题是像 SourceDateTime 这样的日期时间列上的聚类确实 NOT 工作:

CLUSTER BY (to_date(RECORD_CONTENT:jsonSrc:loadDts::datetime))

...在像 Integerfield DOES 这样的字段上进行聚类时:

CLUSTER BY (RECORD_CONTENT:jsonSrc:Integerfield::int )

无效意味着:在RECORD_CONTENT:jsonSrc:loadDts::datetime上使用过滤器时,它无效 在扫描的分区上,同时在 RECORD_CONTENT:jsonSrc:Integerfield::int 上过滤确实执行分区修剪。

这里有什么问题?这是一个错误吗?

注意:

For better pruning and less storage consumption, we recommend flattening your object and key data into separate relational columns if your semi-structured data includes: Dates and timestamps, especially non-ISO 8601dates and timestamps, as string values

Numbers within strings

Arrays

Non-native values such as dates and timestamps are stored as strings when loaded into a VARIANT column, so operations on these values could be slower and also consume more space than when stored in a relational column with the corresponding data type.

看到这个link:https://docs.snowflake.com/en/user-guide/semistructured-considerations.html#storing-semi-structured-data-in-a-variant-column-vs-flattening-the-nested-structure