在 Snowflake 的通用 Table 表达式中使用 "match_recognize"
Using "match_recognize" in a Common Table Expression in Snowflake
更新:已回答 。
我正在整理一个有点复杂的查询,以使用 Snowflake 中的大型时间序列数据集进行事件检测、连接和基于时间的装箱。我最近注意到 match_recognize
让我能够雄辩地检测时间序列事件,但是每当我尝试在通用 Table 表达式 (with .. as ..
) 中使用 match_recognize
表达式时,我都会收到以下错误:
SQL compilation error: MATCH_RECOGNIZE not supported in this context.
我已经做了很多 searching/reading,但在 CTE 中没有发现任何关于 match_recognize
的记录限制。这是我的查询:
with clean_data as (
-- Remove duplicate entries
select distinct id, timestamp, measurement
from dataset
),
label_events as (
select *
from clean_data
match_recognize (
partition by id
order by timestamp
measures
match_number() as event_number
all rows per match
after match skip past last row
pattern(any_row row_between_gaps+)
define
-- Classify contiguous sections of datapoints with < 20min between adjacent points.
row_between_gaps as datediff(minute, lag(timestamp), timestamp) < 20
)
)
-- Do binning with width_bucket/etc. here
select id, timestamp, measurement, event_number
from label_events;
我得到了与上面相同的错误。
这是我没有看到的限制,还是我做错了什么?
非递归 cte 总是可以重写为内联视图:
--select ...
--from (
select id, timestamp, measurement, event_number
from (select distinct id, timestamp, measurement
from dataset) clean_data
match_recognize (
partition by id
order by timestamp
measures
match_number() as event_number
all rows per match
after match skip past last row
pattern(any_row row_between_gaps+)
define
-- Classify contiguous sections of datapoints with < 20min between adjacent points.
row_between_gaps as datediff(minute, lag(timestamp), timestamp) < 20
)mr
-- ) -- if other transformations are required
虽然不理想,但至少可以查询 运行。
根据 Filipe Hoffa 的评论中的此线程:
这似乎是当时 Snowflake 的一个未记录的限制。两步或三步解决方案对我来说效果很好:
with clean_data as (
-- Remove duplicate entries
select distinct id, timestamp, measurement
from dataset
)
select *
from clean_data
match_recognize (
partition by id
order by timestamp
measures
match_number() as event_number
all rows per match
after match skip past last row
pattern(any_row row_between_gaps+)
define
-- Classify contiguous sections of datapoints with < 20min between adjacent points.
row_between_gaps as datediff(minute, lag(timestamp), timestamp) < 20
);
set quid=last_query_id();
with label_events as (
select *
from table(result_scan($quid))
)
-- Do binning with width_bucket/etc. here
select id, timestamp, measurement, event_number
from label_events;
我更喜欢在这里使用变量,因为我可以在 development/debugging 期间多次重新 运行 第二个查询,而不必重新 运行 第一个查询。
同样重要的是要注意 Snowflake 中缓存的 GEOGRAPHY 对象被转换为 GEOJSON,因此当使用 result_scan
检索这些对象时,您必须将它们转换回 GEOGRAPHY 类型。
更新:已回答
我正在整理一个有点复杂的查询,以使用 Snowflake 中的大型时间序列数据集进行事件检测、连接和基于时间的装箱。我最近注意到 match_recognize
让我能够雄辩地检测时间序列事件,但是每当我尝试在通用 Table 表达式 (with .. as ..
) 中使用 match_recognize
表达式时,我都会收到以下错误:
SQL compilation error: MATCH_RECOGNIZE not supported in this context.
我已经做了很多 searching/reading,但在 CTE 中没有发现任何关于 match_recognize
的记录限制。这是我的查询:
with clean_data as (
-- Remove duplicate entries
select distinct id, timestamp, measurement
from dataset
),
label_events as (
select *
from clean_data
match_recognize (
partition by id
order by timestamp
measures
match_number() as event_number
all rows per match
after match skip past last row
pattern(any_row row_between_gaps+)
define
-- Classify contiguous sections of datapoints with < 20min between adjacent points.
row_between_gaps as datediff(minute, lag(timestamp), timestamp) < 20
)
)
-- Do binning with width_bucket/etc. here
select id, timestamp, measurement, event_number
from label_events;
我得到了与上面相同的错误。
这是我没有看到的限制,还是我做错了什么?
非递归 cte 总是可以重写为内联视图:
--select ...
--from (
select id, timestamp, measurement, event_number
from (select distinct id, timestamp, measurement
from dataset) clean_data
match_recognize (
partition by id
order by timestamp
measures
match_number() as event_number
all rows per match
after match skip past last row
pattern(any_row row_between_gaps+)
define
-- Classify contiguous sections of datapoints with < 20min between adjacent points.
row_between_gaps as datediff(minute, lag(timestamp), timestamp) < 20
)mr
-- ) -- if other transformations are required
虽然不理想,但至少可以查询 运行。
根据 Filipe Hoffa 的评论中的此线程:
这似乎是当时 Snowflake 的一个未记录的限制。两步或三步解决方案对我来说效果很好:
with clean_data as (
-- Remove duplicate entries
select distinct id, timestamp, measurement
from dataset
)
select *
from clean_data
match_recognize (
partition by id
order by timestamp
measures
match_number() as event_number
all rows per match
after match skip past last row
pattern(any_row row_between_gaps+)
define
-- Classify contiguous sections of datapoints with < 20min between adjacent points.
row_between_gaps as datediff(minute, lag(timestamp), timestamp) < 20
);
set quid=last_query_id();
with label_events as (
select *
from table(result_scan($quid))
)
-- Do binning with width_bucket/etc. here
select id, timestamp, measurement, event_number
from label_events;
我更喜欢在这里使用变量,因为我可以在 development/debugging 期间多次重新 运行 第二个查询,而不必重新 运行 第一个查询。
同样重要的是要注意 Snowflake 中缓存的 GEOGRAPHY 对象被转换为 GEOJSON,因此当使用 result_scan
检索这些对象时,您必须将它们转换回 GEOGRAPHY 类型。