在 Snowflake 的通用 Table 表达式中使用 "match_recognize"

Using "match_recognize" in a Common Table Expression in Snowflake

更新:已回答

我正在整理一个有点复杂的查询,以使用 Snowflake 中的大型时间序列数据集进行事件检测、连接和基于时间的装箱。我最近注意到 match_recognize 让我能够雄辩地检测时间序列事件,但是每当我尝试在通用 Table 表达式 (with .. as ..) 中使用 match_recognize 表达式时,我都会收到以下错误:

SQL compilation error: MATCH_RECOGNIZE not supported in this context.

我已经做了很多 searching/reading,但在 CTE 中没有发现任何关于 match_recognize 的记录限制。这是我的查询:

with clean_data as (
    -- Remove duplicate entries
    select distinct id, timestamp, measurement
    from dataset
),

label_events as (
    select *
    from clean_data
        match_recognize (
            partition by id
            order by timestamp
            measures
                match_number() as event_number
            all rows per match
            after match skip past last row
            pattern(any_row row_between_gaps+)
            define
                -- Classify contiguous sections of datapoints with < 20min between adjacent points.
                row_between_gaps as datediff(minute, lag(timestamp), timestamp) < 20
        )
)

-- Do binning with width_bucket/etc. here
select id, timestamp, measurement, event_number
from label_events;

我得到了与上面相同的错误。

这是我没有看到的限制,还是我做错了什么?

非递归 cte 总是可以重写为内联视图:

--select ...
--from (
select id, timestamp, measurement, event_number
from (select distinct id, timestamp, measurement
     from dataset) clean_data
match_recognize (
        partition by id
        order by timestamp
        measures
            match_number() as event_number
        all rows per match
        after match skip past last row
        pattern(any_row row_between_gaps+)
        define
            -- Classify contiguous sections of datapoints with < 20min between adjacent points.
            row_between_gaps as datediff(minute, lag(timestamp), timestamp) < 20
    )mr
-- ) -- if other transformations are required

虽然不理想,但至少可以查询 运行。

根据 Filipe Hoffa 的评论中的此线程:

这似乎是当时 Snowflake 的一个未记录的限制。两步或三步解决方案对我来说效果很好:

with clean_data as (
    -- Remove duplicate entries
    select distinct id, timestamp, measurement
    from dataset
)

select *
from clean_data
    match_recognize (
        partition by id
        order by timestamp
        measures
            match_number() as event_number
        all rows per match
        after match skip past last row
        pattern(any_row row_between_gaps+)
        define
            -- Classify contiguous sections of datapoints with < 20min between adjacent points.
            row_between_gaps as datediff(minute, lag(timestamp), timestamp) < 20
    );

set quid=last_query_id();

with label_events as (
    select *
    from table(result_scan($quid))
)

-- Do binning with width_bucket/etc. here
select id, timestamp, measurement, event_number
from label_events;

我更喜欢在这里使用变量,因为我可以在 development/debugging 期间多次重新 运行 第二个查询,而不必重新 运行 第一个查询。

同样重要的是要注意 Snowflake 中缓存的 GEOGRAPHY 对象被转换为 GEOJSON,因此当使用 result_scan 检索这些对象时,您必须将它们转换回 GEOGRAPHY 类型。