在 impala 中找到连续的值

find the consecutive values in impala

我在下面有一个数据集,其中包含 ID、日期和值。我想标记连续三天值为 0 的 ID。

id date value
1 8/10/2021 1
1 8/11/2021 0
1 8/12/2021 0
1 8/13/2021 0
1 8/14/2021 5
2 8/10/2021 2
2 8/11/2021 3
2 8/12/2021 0
2 8/13/2021 0
2 8/14/2021 6
3 8/10/2021 3
3 8/11/2021 4
3 8/12/2021 0
3 8/13/2021 0
3 8/14/2021 0

输出

id date value Flag
1 8/10/2021 1 Y
1 8/11/2021 0 Y
1 8/12/2021 0 Y
1 8/13/2021 0 Y
1 8/14/2021 5 Y
2 8/10/2021 2 N
2 8/11/2021 3 N
2 8/12/2021 0 N
2 8/13/2021 0 N
2 8/14/2021 6 N
3 8/10/2021 3 Y
3 8/11/2021 4 Y
3 8/12/2021 0 Y
3 8/13/2021 0 Y
3 8/14/2021 0 Y

谢谢。

您可以通过比较 lag() 来识别 ID。然后将值分布到所有行。下面在第三个0上获取标志:

select t.*,
       (case when value = 0 and prev_value_date_2 = prev_date_2
             then 'Y' else 'N'
        end) as flag_on_row
from (select t.*,
             lag(date, 2) over (partition by value, id order by date) as prev_value_date_2,
             lag(date, 2) over (partition by id order by date) as prev_date_2
      from t
     ) t;

上面的逻辑使用了lag(),所以很容易扩展到更长的0s。 “2”在后面看两行,所以如果滞后值相同,则连续三行具有相同的值。

并传播价值:

select t.*, max(flag_on_row) over (partition by id) as flag
from (select t.*,
             (case when value = 0 and prev_value_date_2 = prev_date_2
                   then 'Y' else 'N'
              end) as flag_on_row
      from (select t.*,
                   lag(date, 2) over (partition by value, id order by date) as prev_value_date_2,
                   lag(date, 2) over (partition by id order by date) as prev_date_2
            from t
           ) t
     ) t;

使用 window count() 函数,您可以计算 [当前行,后面 2] 帧中的 0(按日期排序)- 为每行计算三个连续的行帧:

count(case when value=0 then 1 else null end) over(partition by id order by date_ rows between current row and 2 following ) cnt.

如果计数恰好等于 3,则表示找到了 3 个连续的 0,case 表达式为 cnt=3 的每一行生成 Ycase when cnt=3 then 'Y' else 'N' end.

要将 'Y' 标志传播到整个 id 组,请使用 max(...) over (partition by id)

使用您的数据示例进行演示(在 Hive 上测试):

with mydata as (--Data example, dates converted to sortable format yyyy-MM-dd
select 1 id,'2021-08-10' date_, 1 value union all
select 1,'2021-08-11',0 union all
select 1,'2021-08-12',0 union all
select 1,'2021-08-13',0 union all
select 1,'2021-08-14',5 union all
select 2,'2021-08-10',2 union all
select 2,'2021-08-11',3 union all
select 2,'2021-08-12',0 union all
select 2,'2021-08-13',0 union all
select 2,'2021-08-14',6 union all
select 3,'2021-08-10',3 union all
select 3,'2021-08-11',4 union all
select 3,'2021-08-12',0 union all
select 3,'2021-08-13',0 union all
select 3,'2021-08-14',0
) --End of data example, use your table instead of this CTE

select id, date_, value, 
       max(case when cnt=3 then 'Y' else 'N' end) over (partition by id) flag
from
(
select id, date_, value, 
 count(case when value=0 then 1 else null end) over(partition by id order by date_ rows between current row and 2 following ) cnt
from mydata
)s
  order by id, date_  --remove ordering if not necessary
                      --added it to get result in the same order

结果:

id  date_       value   flag    
1   2021-08-10  1       Y
1   2021-08-11  0       Y
1   2021-08-12  0       Y
1   2021-08-13  0       Y
1   2021-08-14  5       Y
2   2021-08-10  2       N
2   2021-08-11  3       N
2   2021-08-12  0       N
2   2021-08-13  0       N
2   2021-08-14  6       N
3   2021-08-10  3       Y
3   2021-08-11  4       Y
3   2021-08-12  0       Y
3   2021-08-13  0       Y
3   2021-08-14  0       Y