在 impala 中找到连续的值
find the consecutive values in impala
我在下面有一个数据集,其中包含 ID、日期和值。我想标记连续三天值为 0 的 ID。
id
date
value
1
8/10/2021
1
1
8/11/2021
0
1
8/12/2021
0
1
8/13/2021
0
1
8/14/2021
5
2
8/10/2021
2
2
8/11/2021
3
2
8/12/2021
0
2
8/13/2021
0
2
8/14/2021
6
3
8/10/2021
3
3
8/11/2021
4
3
8/12/2021
0
3
8/13/2021
0
3
8/14/2021
0
输出
id
date
value
Flag
1
8/10/2021
1
Y
1
8/11/2021
0
Y
1
8/12/2021
0
Y
1
8/13/2021
0
Y
1
8/14/2021
5
Y
2
8/10/2021
2
N
2
8/11/2021
3
N
2
8/12/2021
0
N
2
8/13/2021
0
N
2
8/14/2021
6
N
3
8/10/2021
3
Y
3
8/11/2021
4
Y
3
8/12/2021
0
Y
3
8/13/2021
0
Y
3
8/14/2021
0
Y
谢谢。
您可以通过比较 lag()
来识别 ID。然后将值分布到所有行。下面在第三个0
上获取标志:
select t.*,
(case when value = 0 and prev_value_date_2 = prev_date_2
then 'Y' else 'N'
end) as flag_on_row
from (select t.*,
lag(date, 2) over (partition by value, id order by date) as prev_value_date_2,
lag(date, 2) over (partition by id order by date) as prev_date_2
from t
) t;
上面的逻辑使用了lag()
,所以很容易扩展到更长的0
s。 “2”在后面看两行,所以如果滞后值相同,则连续三行具有相同的值。
并传播价值:
select t.*, max(flag_on_row) over (partition by id) as flag
from (select t.*,
(case when value = 0 and prev_value_date_2 = prev_date_2
then 'Y' else 'N'
end) as flag_on_row
from (select t.*,
lag(date, 2) over (partition by value, id order by date) as prev_value_date_2,
lag(date, 2) over (partition by id order by date) as prev_date_2
from t
) t
) t;
使用 window count() 函数,您可以计算 [当前行,后面 2] 帧中的 0(按日期排序)- 为每行计算三个连续的行帧:
count(case when value=0 then 1 else null end) over(partition by id order by date_ rows between current row and 2 following ) cnt
.
如果计数恰好等于 3,则表示找到了 3 个连续的 0,case 表达式为 cnt=3 的每一行生成 Y
:case when cnt=3 then 'Y' else 'N' end
.
要将 'Y' 标志传播到整个 id 组,请使用 max(...) over (partition by id)
使用您的数据示例进行演示(在 Hive 上测试):
with mydata as (--Data example, dates converted to sortable format yyyy-MM-dd
select 1 id,'2021-08-10' date_, 1 value union all
select 1,'2021-08-11',0 union all
select 1,'2021-08-12',0 union all
select 1,'2021-08-13',0 union all
select 1,'2021-08-14',5 union all
select 2,'2021-08-10',2 union all
select 2,'2021-08-11',3 union all
select 2,'2021-08-12',0 union all
select 2,'2021-08-13',0 union all
select 2,'2021-08-14',6 union all
select 3,'2021-08-10',3 union all
select 3,'2021-08-11',4 union all
select 3,'2021-08-12',0 union all
select 3,'2021-08-13',0 union all
select 3,'2021-08-14',0
) --End of data example, use your table instead of this CTE
select id, date_, value,
max(case when cnt=3 then 'Y' else 'N' end) over (partition by id) flag
from
(
select id, date_, value,
count(case when value=0 then 1 else null end) over(partition by id order by date_ rows between current row and 2 following ) cnt
from mydata
)s
order by id, date_ --remove ordering if not necessary
--added it to get result in the same order
结果:
id date_ value flag
1 2021-08-10 1 Y
1 2021-08-11 0 Y
1 2021-08-12 0 Y
1 2021-08-13 0 Y
1 2021-08-14 5 Y
2 2021-08-10 2 N
2 2021-08-11 3 N
2 2021-08-12 0 N
2 2021-08-13 0 N
2 2021-08-14 6 N
3 2021-08-10 3 Y
3 2021-08-11 4 Y
3 2021-08-12 0 Y
3 2021-08-13 0 Y
3 2021-08-14 0 Y
我在下面有一个数据集,其中包含 ID、日期和值。我想标记连续三天值为 0 的 ID。
id | date | value |
---|---|---|
1 | 8/10/2021 | 1 |
1 | 8/11/2021 | 0 |
1 | 8/12/2021 | 0 |
1 | 8/13/2021 | 0 |
1 | 8/14/2021 | 5 |
2 | 8/10/2021 | 2 |
2 | 8/11/2021 | 3 |
2 | 8/12/2021 | 0 |
2 | 8/13/2021 | 0 |
2 | 8/14/2021 | 6 |
3 | 8/10/2021 | 3 |
3 | 8/11/2021 | 4 |
3 | 8/12/2021 | 0 |
3 | 8/13/2021 | 0 |
3 | 8/14/2021 | 0 |
输出
id | date | value | Flag |
---|---|---|---|
1 | 8/10/2021 | 1 | Y |
1 | 8/11/2021 | 0 | Y |
1 | 8/12/2021 | 0 | Y |
1 | 8/13/2021 | 0 | Y |
1 | 8/14/2021 | 5 | Y |
2 | 8/10/2021 | 2 | N |
2 | 8/11/2021 | 3 | N |
2 | 8/12/2021 | 0 | N |
2 | 8/13/2021 | 0 | N |
2 | 8/14/2021 | 6 | N |
3 | 8/10/2021 | 3 | Y |
3 | 8/11/2021 | 4 | Y |
3 | 8/12/2021 | 0 | Y |
3 | 8/13/2021 | 0 | Y |
3 | 8/14/2021 | 0 | Y |
谢谢。
您可以通过比较 lag()
来识别 ID。然后将值分布到所有行。下面在第三个0
上获取标志:
select t.*,
(case when value = 0 and prev_value_date_2 = prev_date_2
then 'Y' else 'N'
end) as flag_on_row
from (select t.*,
lag(date, 2) over (partition by value, id order by date) as prev_value_date_2,
lag(date, 2) over (partition by id order by date) as prev_date_2
from t
) t;
上面的逻辑使用了lag()
,所以很容易扩展到更长的0
s。 “2”在后面看两行,所以如果滞后值相同,则连续三行具有相同的值。
并传播价值:
select t.*, max(flag_on_row) over (partition by id) as flag
from (select t.*,
(case when value = 0 and prev_value_date_2 = prev_date_2
then 'Y' else 'N'
end) as flag_on_row
from (select t.*,
lag(date, 2) over (partition by value, id order by date) as prev_value_date_2,
lag(date, 2) over (partition by id order by date) as prev_date_2
from t
) t
) t;
使用 window count() 函数,您可以计算 [当前行,后面 2] 帧中的 0(按日期排序)- 为每行计算三个连续的行帧:
count(case when value=0 then 1 else null end) over(partition by id order by date_ rows between current row and 2 following ) cnt
.
如果计数恰好等于 3,则表示找到了 3 个连续的 0,case 表达式为 cnt=3 的每一行生成 Y
:case when cnt=3 then 'Y' else 'N' end
.
要将 'Y' 标志传播到整个 id 组,请使用 max(...) over (partition by id)
使用您的数据示例进行演示(在 Hive 上测试):
with mydata as (--Data example, dates converted to sortable format yyyy-MM-dd
select 1 id,'2021-08-10' date_, 1 value union all
select 1,'2021-08-11',0 union all
select 1,'2021-08-12',0 union all
select 1,'2021-08-13',0 union all
select 1,'2021-08-14',5 union all
select 2,'2021-08-10',2 union all
select 2,'2021-08-11',3 union all
select 2,'2021-08-12',0 union all
select 2,'2021-08-13',0 union all
select 2,'2021-08-14',6 union all
select 3,'2021-08-10',3 union all
select 3,'2021-08-11',4 union all
select 3,'2021-08-12',0 union all
select 3,'2021-08-13',0 union all
select 3,'2021-08-14',0
) --End of data example, use your table instead of this CTE
select id, date_, value,
max(case when cnt=3 then 'Y' else 'N' end) over (partition by id) flag
from
(
select id, date_, value,
count(case when value=0 then 1 else null end) over(partition by id order by date_ rows between current row and 2 following ) cnt
from mydata
)s
order by id, date_ --remove ordering if not necessary
--added it to get result in the same order
结果:
id date_ value flag
1 2021-08-10 1 Y
1 2021-08-11 0 Y
1 2021-08-12 0 Y
1 2021-08-13 0 Y
1 2021-08-14 5 Y
2 2021-08-10 2 N
2 2021-08-11 3 N
2 2021-08-12 0 N
2 2021-08-13 0 N
2 2021-08-14 6 N
3 2021-08-10 3 Y
3 2021-08-11 4 Y
3 2021-08-12 0 Y
3 2021-08-13 0 Y
3 2021-08-14 0 Y