选择时间序列的不同连续值

Selecting Distinct Consecutive Values of Timeseries

我在 snowflake/dbt 中有一个 table,我想在其中 select 行中的不同顺序条目。例如: 如果我有

user_id session_id action timestamp
2 3 scroll 21-08-01 12:00:01
2 3 scroll 21-08-01 12:00:02
2 3 scroll 21-08-01 12:00:03
2 3 click 21-08-01 12:00:04
2 3 click 21-08-01 12:00:06
2 3 scroll 21-08-01 12:00:10
2 3 saved 21-08-01 12:00:10

我想把这个放在最后

user_id session_id action timestamp
2 3 scroll 21-08-01 12:00:03
2 3 click 21-08-01 12:00:06
2 3 scroll 21-08-01 12:00:10
2 3 saved 21-08-01 12:00:10

我尝试使用 row_number() 和 next qualify 但这将按顺序计算所有操作,即使它们不是。

您可以尝试以下方法,将最接近发生的操作分组,并按照它们出现的顺序选择最近发生的操作。

SELECT
    user_id,
    session_id,
    action,
    timestamp
FROM (
    SELECT
        *,
        ROW_NUMBER() OVER (
             PARTITION BY user_id,session_id,action,gn
             ORDER BY timestamp DESC
        ) as rn
    FROM (
        SELECT
            *,
            SUM(continued) OVER (ORDER BY timestamp) as gn
        FROM (
            SELECT
                *,
                CASE 
                    WHEN
                        LAG(
                            CONCAT(user_id,session_id,action),
                            1,
                            CONCAT(user_id,session_id,action)
                        ) OVER (
                            ORDER BY timestamp
                        ) = CONCAT(user_id,session_id,action) THEN 0
                    ELSE 1
                END as continued
            FROM
                my_table
        ) t2
    ) t1
) t
WHERE rn=1

让我知道这是否适合你

我尝试了一些与 ggordon 略有不同的方法,使用“下一个”记录的内容构建一个内联视图(使用 LEAD 函数)。

select user_id, session_id, action, ts
from (
  select abc.*, 
         lead(user_id) ignore nulls 
           over (order by ts, user_id, session_id, action) next_user_id, 
         lead(session_id) ignore nulls 
           over (order by ts, user_id, session_id, action) next_session_id, 
         lead(action) ignore nulls 
           over (order by ts, user_id, session_id, action) next_action, 
         lead(ts) ignore nulls 
           over (order by ts, user_id, session_id, action) next_ts
  from   abc 
  order by ts, user_id, session_id, action)
where user_id = NVL(next_user_id, user_id)
and   session_id = NVL(next_session_id, session_id)
and   action <> NVL(next_action, 'x')
order by ts, user_id, session_id, action;

这很有效,我能够获得与您想要的相同的四条记录。

希望对你有帮助...丰富

p.s。如果这个(或另一个)答案对您有帮助,请花点时间“接受”有帮助的答案 通过单击答案旁边的复选标记将其从“灰色”切换为“已填写”。

这称为间隙和孤岛问题。这通常通过两个并发的行编号创建组键来解决。

select
  user_id, session_id, action, max(timestamp)
from
(
  select
    user_id, session_id, action, timestamp,
    row_number() over (order by timestamp, user_id, session_id, action) -
    row_number() over (partition by user_id, session_id, action order by timestamp)
      as grp
  from mytable
)
group by grp, user_id, session_id, action
order by grp, user_id, session_id, action;