计算雪花中的警报洪水
Calculate alarm flood in snowflake
我正在尝试在雪花中进行警报洪水计算。我使用雪花 window 函数创建了以下数据集。因此,如果该值大于或等于 3,则警报泛滥将开始,对于下一个 0 值,它将结束。所以在下面的例子中,警报洪水开始于“9:51”,结束于“9:54”,持续了 3 分钟。下一次洪水开始于“9:57”,结束于“10:02”,即对于5minutes.FYI,9:59处的值是3,但是因为已经开始泛洪了,所以不用考虑了,下一次泛洪在10:03但是没有0值,因此我们必须考虑边缘值 10:06。
所以洪水的总时间是 3+5+4= 12 分钟。
DateTime Value
3/10/2020 9:50 1
3/10/2020 9:51 3
3/10/2020 9:52 1
3/10/2020 9:53 2
3/10/2020 9:54 0
3/10/2020 9:55 0
3/10/2020 9:56 1
3/10/2020 9:57 3
3/10/2020 9:58 2
3/10/2020 9:59 3
3/10/2020 10:00 2
3/10/2020 10:01 2
3/10/2020 10:02 0
3/10/2020 10:03 3
3/10/2020 10:04 1
3/10/2020 10:05 1
3/10/2020 10:06 1
所以,简而言之,我希望输出低于
我在下面尝试 SQL 但它没有给我正确的输出,它在第二次洪水时间失败(因为在下一个 0 之前再次出现值 3)
select t.*,
(case when value >= 3
then datediff(minute,
datetime,
min(case when value = 0 then datetime end) over (order by datetime desc)
)
end) as diff_minutes
from t;
我不是最引以为豪的代码,但它确实有效并提供了一个起点。我确信它可以被清理或简化。而且我还没有评估更大表的性能。
我使用的关键见解是,如果你将 date_diff 添加到日期,那么你会发现它们都添加到相同的值的情况,这意味着它们都在计算相同的值“ 0”记录。希望这个概念对您有所帮助。
此外,第一个 cte 是在结果末尾获得 4 的半 hacky 方法。
--Add a fake zero at the end of the table to provide a value for
-- comparing high values that have not been resolved
-- added a flag so this fake value can be removed later
with fakezero as
(
SELECT datetime, value, 1 flag
FROM test
UNION ALL
SELECT dateadd(minute, 1, max(datetime)) datetime, 0 value, 0 flag
FROM test
)
-- Find date diffs between high values and subsequent low values
,diffs as (
select t.*,
(case when value >= 3
then datediff(minute,
datetime,
min(case when value = 0 then datetime end) over (order by datetime desc)
)
end) as diff_minutes
from fakezero t
)
--Fix cases where two High values are "resolved" by the same low value
--i.e. when adding the date_diff to the datetime results in the same timestamp
-- this means that the prior high value record that still hasn't been "resolved"
select
datetime
,value
,case when
lag(dateadd(minute, diff_minutes, datetime)) over(partition by value order by datetime)
= dateadd(minute, diff_minutes, datetime)
then null
else diff_minutes
end as diff_minutes
from diffs
where flag = 1
order by datetime;
WITH data as (
select time::timestamp as time, value from values
('2020-03-10 9:50', 1 ),
('2020-03-10 9:51', 3 ),
('2020-03-10 9:52', 1 ),
('2020-03-10 9:53', 2 ),
('2020-03-10 9:54', 0 ),
('2020-03-10 9:55', 0 ),
('2020-03-10 9:56', 1 ),
('2020-03-10 9:57', 3 ),
('2020-03-10 9:58', 2 ),
('2020-03-10 9:59', 3 ),
('2020-03-10 10:00', 2 ),
('2020-03-10 10:01', 2 ),
('2020-03-10 10:02', 0 ),
('2020-03-10 10:03', 3 ),
('2020-03-10 10:04', 1 ),
('2020-03-10 10:05', 1 ),
('2020-03-10 10:06', 1 )
s( time, value)
)
select
a.time
,a.value
,min(trig_time)over(partition by reset_time_group order by time) as first_trigger_time
,iff(a.time=first_trigger_time, datediff('minute', first_trigger_time, reset_time_group), null) as trig_duration
from (
select d.time
,d.value
,iff(d.value>=3,d.time,null) as trig_time
,iff(d.value=0,d.time,null) as reset_time
,max(time)over(order by time ROWS BETWEEN 1 PRECEDING AND UNBOUNDED FOLLOWING) as max_time
,coalesce(lead(reset_time)ignore nulls over(order by d.time), max_time) as lead_reset_time
,coalesce(reset_time,lead_reset_time) as reset_time_group
from data as d
) as a
order by time;
这给出了您似乎 expect/describe..
的结果
TIME VALUE FIRST_TRIGGER_TIME TRIG_DURATION
2020-03-10 09:50:00.000 1
2020-03-10 09:51:00.000 3 2020-03-10 09:51:00.000 3
2020-03-10 09:52:00.000 1 2020-03-10 09:51:00.000
2020-03-10 09:53:00.000 2 2020-03-10 09:51:00.000
2020-03-10 09:54:00.000 0 2020-03-10 09:51:00.000
2020-03-10 09:55:00.000 0
2020-03-10 09:56:00.000 1
2020-03-10 09:57:00.000 3 2020-03-10 09:57:00.000 5
2020-03-10 09:58:00.000 2 2020-03-10 09:57:00.000
2020-03-10 09:59:00.000 3 2020-03-10 09:57:00.000
2020-03-10 10:00:00.000 2 2020-03-10 09:57:00.000
2020-03-10 10:01:00.000 2 2020-03-10 09:57:00.000
2020-03-10 10:02:00.000 0 2020-03-10 09:57:00.000
2020-03-10 10:03:00.000 3 2020-03-10 10:03:00.000 3
2020-03-10 10:04:00.000 1 2020-03-10 10:03:00.000
2020-03-10 10:05:00.000 1 2020-03-10 10:03:00.000
2020-03-10 10:06:00.000 1 2020-03-10 10:03:00.000
所以它的工作原理是我们找到触发时间和重置时间,然后计算出最后一行边缘情况的 max_time。之后我们找到下一个reset_time转发,如果有none则使用max_time,然后select当前重置时间或之前的lead_reset_time,对于您在此处所做的工作,可以忽略此步骤,因为您的数据无法触发和重置同一行。鉴于我们在触发行上进行数学运算,重置行知道它属于哪个组并不重要。
然后我们进入一个新的 select 层,因为我们已经达到 nested/interrelated SQL 的雪花限制,并在 reset_group 上做一分钟以找到第一个触发时间,然后我们将其与行时间进行比较并进行日期差异。
旁注 date_diff 在数学上有点幼稚,'2020-01-01 23:59:59' '2020-01-02 00:00:01' 相隔 2 秒,但相隔 1 分钟、1 小时和 1 天,因为该函数将时间戳转换为 selected 单元(并截断),然后对这些结果进行差分。
要获得请求中要求的具有值 4 的最终批次,请将 lead_reset_time 行更改为:
,coalesce(lead(reset_time)ignore nulls over(order by d.time), dateadd('minute', 1, max_time)) as lead_reset_time
将此 max_time 向前移动一分钟,如果您想要假设在未来时间除了拥有数据之外 10:06 的现有行状态在 1 分钟内有效。这不是我会做的......但是有你想要的代码..
javascript udf 版本:
select d, v, iff(3<=v and 1=row_number() over (partition by N order by d),
count(*) over (partition by N), null) trig_duration
from t, lateral flood_count(t.v::float)
order by d;
其中 flood_count() 定义为:
create or replace function flood_count(V float)
returns table (N float)
language javascript AS
$${
initialize: function() {
this.n = 0
this.flood = false
},
processRow: function(row, rowWriter) {
if (3<=row.V && !this.flood) {
this.flood = true
this.n++
}
else if (0==row.V) this.flood=false
rowWriter.writeRow({ N: this.flood ? this.n : null })
},
}$$;
假设输入:
create or replace table t as
select to_timestamp(d, 'mm/dd/yyyy hh:mi') d, v
from values
('3/10/2020 9:50', 1),
('3/10/2020 9:51', 3),
('3/10/2020 9:52', 1),
('3/10/2020 9:53', 2),
('3/10/2020 9:54', 0),
('3/10/2020 9:55', 0),
('3/10/2020 9:56', 1),
('3/10/2020 9:57', 3),
('3/10/2020 9:58', 2),
('3/10/2020 9:59', 3),
('3/10/2020 10:00', 2),
('3/10/2020 10:01', 2),
('3/10/2020 10:02', 0),
('3/10/2020 10:03', 3),
('3/10/2020 10:04', 1),
('3/10/2020 10:05', 1),
('3/10/2020 10:06', 1)
t(d,v)
;
我正在尝试在雪花中进行警报洪水计算。我使用雪花 window 函数创建了以下数据集。因此,如果该值大于或等于 3,则警报泛滥将开始,对于下一个 0 值,它将结束。所以在下面的例子中,警报洪水开始于“9:51”,结束于“9:54”,持续了 3 分钟。下一次洪水开始于“9:57”,结束于“10:02”,即对于5minutes.FYI,9:59处的值是3,但是因为已经开始泛洪了,所以不用考虑了,下一次泛洪在10:03但是没有0值,因此我们必须考虑边缘值 10:06。 所以洪水的总时间是 3+5+4= 12 分钟。
DateTime Value
3/10/2020 9:50 1
3/10/2020 9:51 3
3/10/2020 9:52 1
3/10/2020 9:53 2
3/10/2020 9:54 0
3/10/2020 9:55 0
3/10/2020 9:56 1
3/10/2020 9:57 3
3/10/2020 9:58 2
3/10/2020 9:59 3
3/10/2020 10:00 2
3/10/2020 10:01 2
3/10/2020 10:02 0
3/10/2020 10:03 3
3/10/2020 10:04 1
3/10/2020 10:05 1
3/10/2020 10:06 1
所以,简而言之,我希望输出低于
我在下面尝试 SQL 但它没有给我正确的输出,它在第二次洪水时间失败(因为在下一个 0 之前再次出现值 3)
select t.*,
(case when value >= 3
then datediff(minute,
datetime,
min(case when value = 0 then datetime end) over (order by datetime desc)
)
end) as diff_minutes
from t;
我不是最引以为豪的代码,但它确实有效并提供了一个起点。我确信它可以被清理或简化。而且我还没有评估更大表的性能。
我使用的关键见解是,如果你将 date_diff 添加到日期,那么你会发现它们都添加到相同的值的情况,这意味着它们都在计算相同的值“ 0”记录。希望这个概念对您有所帮助。
此外,第一个 cte 是在结果末尾获得 4 的半 hacky 方法。
--Add a fake zero at the end of the table to provide a value for
-- comparing high values that have not been resolved
-- added a flag so this fake value can be removed later
with fakezero as
(
SELECT datetime, value, 1 flag
FROM test
UNION ALL
SELECT dateadd(minute, 1, max(datetime)) datetime, 0 value, 0 flag
FROM test
)
-- Find date diffs between high values and subsequent low values
,diffs as (
select t.*,
(case when value >= 3
then datediff(minute,
datetime,
min(case when value = 0 then datetime end) over (order by datetime desc)
)
end) as diff_minutes
from fakezero t
)
--Fix cases where two High values are "resolved" by the same low value
--i.e. when adding the date_diff to the datetime results in the same timestamp
-- this means that the prior high value record that still hasn't been "resolved"
select
datetime
,value
,case when
lag(dateadd(minute, diff_minutes, datetime)) over(partition by value order by datetime)
= dateadd(minute, diff_minutes, datetime)
then null
else diff_minutes
end as diff_minutes
from diffs
where flag = 1
order by datetime;
WITH data as (
select time::timestamp as time, value from values
('2020-03-10 9:50', 1 ),
('2020-03-10 9:51', 3 ),
('2020-03-10 9:52', 1 ),
('2020-03-10 9:53', 2 ),
('2020-03-10 9:54', 0 ),
('2020-03-10 9:55', 0 ),
('2020-03-10 9:56', 1 ),
('2020-03-10 9:57', 3 ),
('2020-03-10 9:58', 2 ),
('2020-03-10 9:59', 3 ),
('2020-03-10 10:00', 2 ),
('2020-03-10 10:01', 2 ),
('2020-03-10 10:02', 0 ),
('2020-03-10 10:03', 3 ),
('2020-03-10 10:04', 1 ),
('2020-03-10 10:05', 1 ),
('2020-03-10 10:06', 1 )
s( time, value)
)
select
a.time
,a.value
,min(trig_time)over(partition by reset_time_group order by time) as first_trigger_time
,iff(a.time=first_trigger_time, datediff('minute', first_trigger_time, reset_time_group), null) as trig_duration
from (
select d.time
,d.value
,iff(d.value>=3,d.time,null) as trig_time
,iff(d.value=0,d.time,null) as reset_time
,max(time)over(order by time ROWS BETWEEN 1 PRECEDING AND UNBOUNDED FOLLOWING) as max_time
,coalesce(lead(reset_time)ignore nulls over(order by d.time), max_time) as lead_reset_time
,coalesce(reset_time,lead_reset_time) as reset_time_group
from data as d
) as a
order by time;
这给出了您似乎 expect/describe..
的结果TIME VALUE FIRST_TRIGGER_TIME TRIG_DURATION
2020-03-10 09:50:00.000 1
2020-03-10 09:51:00.000 3 2020-03-10 09:51:00.000 3
2020-03-10 09:52:00.000 1 2020-03-10 09:51:00.000
2020-03-10 09:53:00.000 2 2020-03-10 09:51:00.000
2020-03-10 09:54:00.000 0 2020-03-10 09:51:00.000
2020-03-10 09:55:00.000 0
2020-03-10 09:56:00.000 1
2020-03-10 09:57:00.000 3 2020-03-10 09:57:00.000 5
2020-03-10 09:58:00.000 2 2020-03-10 09:57:00.000
2020-03-10 09:59:00.000 3 2020-03-10 09:57:00.000
2020-03-10 10:00:00.000 2 2020-03-10 09:57:00.000
2020-03-10 10:01:00.000 2 2020-03-10 09:57:00.000
2020-03-10 10:02:00.000 0 2020-03-10 09:57:00.000
2020-03-10 10:03:00.000 3 2020-03-10 10:03:00.000 3
2020-03-10 10:04:00.000 1 2020-03-10 10:03:00.000
2020-03-10 10:05:00.000 1 2020-03-10 10:03:00.000
2020-03-10 10:06:00.000 1 2020-03-10 10:03:00.000
所以它的工作原理是我们找到触发时间和重置时间,然后计算出最后一行边缘情况的 max_time。之后我们找到下一个reset_time转发,如果有none则使用max_time,然后select当前重置时间或之前的lead_reset_time,对于您在此处所做的工作,可以忽略此步骤,因为您的数据无法触发和重置同一行。鉴于我们在触发行上进行数学运算,重置行知道它属于哪个组并不重要。
然后我们进入一个新的 select 层,因为我们已经达到 nested/interrelated SQL 的雪花限制,并在 reset_group 上做一分钟以找到第一个触发时间,然后我们将其与行时间进行比较并进行日期差异。
旁注 date_diff 在数学上有点幼稚,'2020-01-01 23:59:59' '2020-01-02 00:00:01' 相隔 2 秒,但相隔 1 分钟、1 小时和 1 天,因为该函数将时间戳转换为 selected 单元(并截断),然后对这些结果进行差分。
要获得请求中要求的具有值 4 的最终批次,请将 lead_reset_time 行更改为:
,coalesce(lead(reset_time)ignore nulls over(order by d.time), dateadd('minute', 1, max_time)) as lead_reset_time
将此 max_time 向前移动一分钟,如果您想要假设在未来时间除了拥有数据之外 10:06 的现有行状态在 1 分钟内有效。这不是我会做的......但是有你想要的代码..
javascript udf 版本:
select d, v, iff(3<=v and 1=row_number() over (partition by N order by d),
count(*) over (partition by N), null) trig_duration
from t, lateral flood_count(t.v::float)
order by d;
其中 flood_count() 定义为:
create or replace function flood_count(V float)
returns table (N float)
language javascript AS
$${
initialize: function() {
this.n = 0
this.flood = false
},
processRow: function(row, rowWriter) {
if (3<=row.V && !this.flood) {
this.flood = true
this.n++
}
else if (0==row.V) this.flood=false
rowWriter.writeRow({ N: this.flood ? this.n : null })
},
}$$;
假设输入:
create or replace table t as
select to_timestamp(d, 'mm/dd/yyyy hh:mi') d, v
from values
('3/10/2020 9:50', 1),
('3/10/2020 9:51', 3),
('3/10/2020 9:52', 1),
('3/10/2020 9:53', 2),
('3/10/2020 9:54', 0),
('3/10/2020 9:55', 0),
('3/10/2020 9:56', 1),
('3/10/2020 9:57', 3),
('3/10/2020 9:58', 2),
('3/10/2020 9:59', 3),
('3/10/2020 10:00', 2),
('3/10/2020 10:01', 2),
('3/10/2020 10:02', 0),
('3/10/2020 10:03', 3),
('3/10/2020 10:04', 1),
('3/10/2020 10:05', 1),
('3/10/2020 10:06', 1)
t(d,v)
;