SQL 获得 3 个相邻的动作,而不从标志中重复

SQL to get 3 adjacent actions without duplicate from the flags

我有一个与有点相似但更复杂的问题。

这是我的虚拟数据。

我想从每个用户的标志中获取 3 个相邻的操作(无重复)。

这是描述我的想法的图表。

这是我想要的:

如何实现 SQL(我使用 Google Bigquery)? 我知道 LAG 函数可能是一个解决方案,但我不知道如何避免重复操作。

希望有人能点亮我。百万感谢!

这是生成数据集的代码。

WITH
src_table AS (
SELECT 'Jack' AS User, 1 AS Sequence, 'Eat' AS Action, '' AS Flag UNION ALL
SELECT 'Jack' AS User, 2 AS Sequence, 'Work' AS Action, '' AS Flag UNION ALL
SELECT 'Jack' AS User, 3 AS Sequence, 'Sleep' AS Action, 'Flag A' AS Flag UNION ALL
SELECT 'Jack' AS User, 4 AS Sequence, 'Exercise' AS Action, 'Flag B' AS Flag UNION ALL
SELECT 'Kenny' AS User, 1 AS Sequence, 'Run' AS Action, '' AS Flag UNION ALL
SELECT 'Kenny' AS User, 2 AS Sequence, 'Eat' AS Action, '' AS Flag UNION ALL
SELECT 'Kenny' AS User, 3 AS Sequence, 'Eat' AS Action, '' AS Flag UNION ALL
SELECT 'Kenny' AS User, 4 AS Sequence, 'Work' AS Action, 'Flag C' AS Flag UNION ALL
SELECT 'Kenny' AS User, 5 AS Sequence, 'Work' AS Action, 'Flag D' AS Flag UNION ALL
SELECT 'May' AS User, 1 AS Sequence, 'Work' AS Action, 'Flag A' AS Flag
)

您可以使用 RANK() 对重复项进行排序,然后过滤 RANK() = 1 以获得每个重复项中的第一个(或最后一个)。然后问题就还原成你提到的另一个问题了。

这与您之前的查询类似。如果我假设具有相同操作的相邻行最多有一个标志,那么我们可以使用间隙和孤岛方法。 . .然后滞后。

第一步是:

select user, min(sequence) as seqnuence, action, max(flag) as flag
from (select t.*,
             row_number() over (partition by user order by sequence) as seqnum
      from t
     ) t
group by user, sequence - seqnum;

然后,以此为“基础”数据,我们可以使用滞后:

with cte as (
      select user, min(sequence) as seqnuence, action, max(flag) as flag
      from (select t.*,
                   row_number() over (partition by user order by sequence) as seqnum
            from t
           ) t
      group by user, sequence - seqnum
     )
select user, prev_action, prev_action_2, action, flag
from (select t.*,
             lag(action) over (partition by user order by sequence) as prev_action,
             lag(action, 2) over (partition by user order by sequence) as prev_action2
      from t
     ) t
where prev_action is not null;

如果具有相同 activity 的用户可以有不同的标志,如果您能提出 new 问题,我将不胜感激。在新问题中,如果您包含 SELECT 语句来生成正在使用的样本数据,将会很有帮助。

考虑以下

select user, actions.action_sequence, flag  from (
  select *, (
    select as struct count(1) actions_count,
      string_agg(action, ' >> ' order by grp) action_sequence
    from (
      select action, grp from t.arr group by action, grp
    )) actions
  from (
    select *, array_agg(struct(action, grp)) 
      over(partition by user order by grp desc range between current row and 2 following) arr
    from (
      select *, countif(change) over(partition by user order by sequence) grp
      from (
        select *, action != lag(action) over(partition by user order by sequence) change
        from src_table
      )
    )
  ) t
)
where flag != '' 
and actions.actions_count = 3
# order by user, sequence

如果应用于您问题中的示例数据 - 输出为

注意:以上解决方案适用于任意数量的相邻操作(无重复)- 您只需要在两个相应的位置更改它(2 和 3)

over(partition by user order by grp desc range between current row and 2 following) arr    

and actions.actions_count = 3