选择时间序列的不同连续值
Selecting Distinct Consecutive Values of Timeseries
我在 snowflake/dbt 中有一个 table,我想在其中 select 行中的不同顺序条目。例如:
如果我有
user_id
session_id
action
timestamp
2
3
scroll
21-08-01 12:00:01
2
3
scroll
21-08-01 12:00:02
2
3
scroll
21-08-01 12:00:03
2
3
click
21-08-01 12:00:04
2
3
click
21-08-01 12:00:06
2
3
scroll
21-08-01 12:00:10
2
3
saved
21-08-01 12:00:10
我想把这个放在最后
user_id
session_id
action
timestamp
2
3
scroll
21-08-01 12:00:03
2
3
click
21-08-01 12:00:06
2
3
scroll
21-08-01 12:00:10
2
3
saved
21-08-01 12:00:10
我尝试使用 row_number() 和 next qualify 但这将按顺序计算所有操作,即使它们不是。
您可以尝试以下方法,将最接近发生的操作分组,并按照它们出现的顺序选择最近发生的操作。
SELECT
user_id,
session_id,
action,
timestamp
FROM (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY user_id,session_id,action,gn
ORDER BY timestamp DESC
) as rn
FROM (
SELECT
*,
SUM(continued) OVER (ORDER BY timestamp) as gn
FROM (
SELECT
*,
CASE
WHEN
LAG(
CONCAT(user_id,session_id,action),
1,
CONCAT(user_id,session_id,action)
) OVER (
ORDER BY timestamp
) = CONCAT(user_id,session_id,action) THEN 0
ELSE 1
END as continued
FROM
my_table
) t2
) t1
) t
WHERE rn=1
让我知道这是否适合你
我尝试了一些与 ggordon 略有不同的方法,使用“下一个”记录的内容构建一个内联视图(使用 LEAD 函数)。
select user_id, session_id, action, ts
from (
select abc.*,
lead(user_id) ignore nulls
over (order by ts, user_id, session_id, action) next_user_id,
lead(session_id) ignore nulls
over (order by ts, user_id, session_id, action) next_session_id,
lead(action) ignore nulls
over (order by ts, user_id, session_id, action) next_action,
lead(ts) ignore nulls
over (order by ts, user_id, session_id, action) next_ts
from abc
order by ts, user_id, session_id, action)
where user_id = NVL(next_user_id, user_id)
and session_id = NVL(next_session_id, session_id)
and action <> NVL(next_action, 'x')
order by ts, user_id, session_id, action;
这很有效,我能够获得与您想要的相同的四条记录。
希望对你有帮助...丰富
p.s。如果这个(或另一个)答案对您有帮助,请花点时间“接受”有帮助的答案
通过单击答案旁边的复选标记将其从“灰色”切换为“已填写”。
这称为间隙和孤岛问题。这通常通过两个并发的行编号创建组键来解决。
select
user_id, session_id, action, max(timestamp)
from
(
select
user_id, session_id, action, timestamp,
row_number() over (order by timestamp, user_id, session_id, action) -
row_number() over (partition by user_id, session_id, action order by timestamp)
as grp
from mytable
)
group by grp, user_id, session_id, action
order by grp, user_id, session_id, action;
我在 snowflake/dbt 中有一个 table,我想在其中 select 行中的不同顺序条目。例如: 如果我有
user_id | session_id | action | timestamp |
---|---|---|---|
2 | 3 | scroll | 21-08-01 12:00:01 |
2 | 3 | scroll | 21-08-01 12:00:02 |
2 | 3 | scroll | 21-08-01 12:00:03 |
2 | 3 | click | 21-08-01 12:00:04 |
2 | 3 | click | 21-08-01 12:00:06 |
2 | 3 | scroll | 21-08-01 12:00:10 |
2 | 3 | saved | 21-08-01 12:00:10 |
我想把这个放在最后
user_id | session_id | action | timestamp |
---|---|---|---|
2 | 3 | scroll | 21-08-01 12:00:03 |
2 | 3 | click | 21-08-01 12:00:06 |
2 | 3 | scroll | 21-08-01 12:00:10 |
2 | 3 | saved | 21-08-01 12:00:10 |
我尝试使用 row_number() 和 next qualify 但这将按顺序计算所有操作,即使它们不是。
您可以尝试以下方法,将最接近发生的操作分组,并按照它们出现的顺序选择最近发生的操作。
SELECT
user_id,
session_id,
action,
timestamp
FROM (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY user_id,session_id,action,gn
ORDER BY timestamp DESC
) as rn
FROM (
SELECT
*,
SUM(continued) OVER (ORDER BY timestamp) as gn
FROM (
SELECT
*,
CASE
WHEN
LAG(
CONCAT(user_id,session_id,action),
1,
CONCAT(user_id,session_id,action)
) OVER (
ORDER BY timestamp
) = CONCAT(user_id,session_id,action) THEN 0
ELSE 1
END as continued
FROM
my_table
) t2
) t1
) t
WHERE rn=1
让我知道这是否适合你
我尝试了一些与 ggordon 略有不同的方法,使用“下一个”记录的内容构建一个内联视图(使用 LEAD 函数)。
select user_id, session_id, action, ts
from (
select abc.*,
lead(user_id) ignore nulls
over (order by ts, user_id, session_id, action) next_user_id,
lead(session_id) ignore nulls
over (order by ts, user_id, session_id, action) next_session_id,
lead(action) ignore nulls
over (order by ts, user_id, session_id, action) next_action,
lead(ts) ignore nulls
over (order by ts, user_id, session_id, action) next_ts
from abc
order by ts, user_id, session_id, action)
where user_id = NVL(next_user_id, user_id)
and session_id = NVL(next_session_id, session_id)
and action <> NVL(next_action, 'x')
order by ts, user_id, session_id, action;
这很有效,我能够获得与您想要的相同的四条记录。
希望对你有帮助...丰富
p.s。如果这个(或另一个)答案对您有帮助,请花点时间“接受”有帮助的答案 通过单击答案旁边的复选标记将其从“灰色”切换为“已填写”。
这称为间隙和孤岛问题。这通常通过两个并发的行编号创建组键来解决。
select
user_id, session_id, action, max(timestamp)
from
(
select
user_id, session_id, action, timestamp,
row_number() over (order by timestamp, user_id, session_id, action) -
row_number() over (partition by user_id, session_id, action order by timestamp)
as grp
from mytable
)
group by grp, user_id, session_id, action
order by grp, user_id, session_id, action;