SQL 获得 3 个相邻的动作,而不从标志中重复
SQL to get 3 adjacent actions without duplicate from the flags
我有一个与有点相似但更复杂的问题。
这是我的虚拟数据。
我想从每个用户的标志中获取 3 个相邻的操作(无重复)。
这是描述我的想法的图表。
这是我想要的:
如何实现 SQL(我使用 Google Bigquery)?
我知道 LAG 函数可能是一个解决方案,但我不知道如何避免重复操作。
希望有人能点亮我。百万感谢!
这是生成数据集的代码。
WITH
src_table AS (
SELECT 'Jack' AS User, 1 AS Sequence, 'Eat' AS Action, '' AS Flag UNION ALL
SELECT 'Jack' AS User, 2 AS Sequence, 'Work' AS Action, '' AS Flag UNION ALL
SELECT 'Jack' AS User, 3 AS Sequence, 'Sleep' AS Action, 'Flag A' AS Flag UNION ALL
SELECT 'Jack' AS User, 4 AS Sequence, 'Exercise' AS Action, 'Flag B' AS Flag UNION ALL
SELECT 'Kenny' AS User, 1 AS Sequence, 'Run' AS Action, '' AS Flag UNION ALL
SELECT 'Kenny' AS User, 2 AS Sequence, 'Eat' AS Action, '' AS Flag UNION ALL
SELECT 'Kenny' AS User, 3 AS Sequence, 'Eat' AS Action, '' AS Flag UNION ALL
SELECT 'Kenny' AS User, 4 AS Sequence, 'Work' AS Action, 'Flag C' AS Flag UNION ALL
SELECT 'Kenny' AS User, 5 AS Sequence, 'Work' AS Action, 'Flag D' AS Flag UNION ALL
SELECT 'May' AS User, 1 AS Sequence, 'Work' AS Action, 'Flag A' AS Flag
)
您可以使用 RANK() 对重复项进行排序,然后过滤 RANK() = 1 以获得每个重复项中的第一个(或最后一个)。然后问题就还原成你提到的另一个问题了。
这与您之前的查询类似。如果我假设具有相同操作的相邻行最多有一个标志,那么我们可以使用间隙和孤岛方法。 . .然后滞后。
第一步是:
select user, min(sequence) as seqnuence, action, max(flag) as flag
from (select t.*,
row_number() over (partition by user order by sequence) as seqnum
from t
) t
group by user, sequence - seqnum;
然后,以此为“基础”数据,我们可以使用滞后:
with cte as (
select user, min(sequence) as seqnuence, action, max(flag) as flag
from (select t.*,
row_number() over (partition by user order by sequence) as seqnum
from t
) t
group by user, sequence - seqnum
)
select user, prev_action, prev_action_2, action, flag
from (select t.*,
lag(action) over (partition by user order by sequence) as prev_action,
lag(action, 2) over (partition by user order by sequence) as prev_action2
from t
) t
where prev_action is not null;
如果具有相同 activity 的用户可以有不同的标志,如果您能提出 new 问题,我将不胜感激。在新问题中,如果您包含 SELECT
语句来生成正在使用的样本数据,将会很有帮助。
考虑以下
select user, actions.action_sequence, flag from (
select *, (
select as struct count(1) actions_count,
string_agg(action, ' >> ' order by grp) action_sequence
from (
select action, grp from t.arr group by action, grp
)) actions
from (
select *, array_agg(struct(action, grp))
over(partition by user order by grp desc range between current row and 2 following) arr
from (
select *, countif(change) over(partition by user order by sequence) grp
from (
select *, action != lag(action) over(partition by user order by sequence) change
from src_table
)
)
) t
)
where flag != ''
and actions.actions_count = 3
# order by user, sequence
如果应用于您问题中的示例数据 - 输出为
注意:以上解决方案适用于任意数量的相邻操作(无重复)- 您只需要在两个相应的位置更改它(2 和 3)
over(partition by user order by grp desc range between current row and 2 following) arr
和
and actions.actions_count = 3
我有一个与
这是我的虚拟数据。
我想从每个用户的标志中获取 3 个相邻的操作(无重复)。
这是描述我的想法的图表。
这是我想要的:
如何实现 SQL(我使用 Google Bigquery)? 我知道 LAG 函数可能是一个解决方案,但我不知道如何避免重复操作。
希望有人能点亮我。百万感谢!
这是生成数据集的代码。
WITH
src_table AS (
SELECT 'Jack' AS User, 1 AS Sequence, 'Eat' AS Action, '' AS Flag UNION ALL
SELECT 'Jack' AS User, 2 AS Sequence, 'Work' AS Action, '' AS Flag UNION ALL
SELECT 'Jack' AS User, 3 AS Sequence, 'Sleep' AS Action, 'Flag A' AS Flag UNION ALL
SELECT 'Jack' AS User, 4 AS Sequence, 'Exercise' AS Action, 'Flag B' AS Flag UNION ALL
SELECT 'Kenny' AS User, 1 AS Sequence, 'Run' AS Action, '' AS Flag UNION ALL
SELECT 'Kenny' AS User, 2 AS Sequence, 'Eat' AS Action, '' AS Flag UNION ALL
SELECT 'Kenny' AS User, 3 AS Sequence, 'Eat' AS Action, '' AS Flag UNION ALL
SELECT 'Kenny' AS User, 4 AS Sequence, 'Work' AS Action, 'Flag C' AS Flag UNION ALL
SELECT 'Kenny' AS User, 5 AS Sequence, 'Work' AS Action, 'Flag D' AS Flag UNION ALL
SELECT 'May' AS User, 1 AS Sequence, 'Work' AS Action, 'Flag A' AS Flag
)
您可以使用 RANK() 对重复项进行排序,然后过滤 RANK() = 1 以获得每个重复项中的第一个(或最后一个)。然后问题就还原成你提到的另一个问题了。
这与您之前的查询类似。如果我假设具有相同操作的相邻行最多有一个标志,那么我们可以使用间隙和孤岛方法。 . .然后滞后。
第一步是:
select user, min(sequence) as seqnuence, action, max(flag) as flag
from (select t.*,
row_number() over (partition by user order by sequence) as seqnum
from t
) t
group by user, sequence - seqnum;
然后,以此为“基础”数据,我们可以使用滞后:
with cte as (
select user, min(sequence) as seqnuence, action, max(flag) as flag
from (select t.*,
row_number() over (partition by user order by sequence) as seqnum
from t
) t
group by user, sequence - seqnum
)
select user, prev_action, prev_action_2, action, flag
from (select t.*,
lag(action) over (partition by user order by sequence) as prev_action,
lag(action, 2) over (partition by user order by sequence) as prev_action2
from t
) t
where prev_action is not null;
如果具有相同 activity 的用户可以有不同的标志,如果您能提出 new 问题,我将不胜感激。在新问题中,如果您包含 SELECT
语句来生成正在使用的样本数据,将会很有帮助。
考虑以下
select user, actions.action_sequence, flag from (
select *, (
select as struct count(1) actions_count,
string_agg(action, ' >> ' order by grp) action_sequence
from (
select action, grp from t.arr group by action, grp
)) actions
from (
select *, array_agg(struct(action, grp))
over(partition by user order by grp desc range between current row and 2 following) arr
from (
select *, countif(change) over(partition by user order by sequence) grp
from (
select *, action != lag(action) over(partition by user order by sequence) change
from src_table
)
)
) t
)
where flag != ''
and actions.actions_count = 3
# order by user, sequence
如果应用于您问题中的示例数据 - 输出为
注意:以上解决方案适用于任意数量的相邻操作(无重复)- 您只需要在两个相应的位置更改它(2 和 3)
over(partition by user order by grp desc range between current row and 2 following) arr
和
and actions.actions_count = 3