SQL 分区中两行之间的时间差 window
SQL time difference between two rows in a partition window
我有一个 table 分析事件,我正在尝试计算两行之间的时间差,即用户在尝试开始和实际开始之间花费的时间。
我的数据是这样的:
#
session
type
recordedAt
1
D4E77C
feedbackProvided
2021-08-17T09:13:00.768+03:00
2
D4E77C
feedbackProvided
2021-08-17T12:06:03.301+03:00
3
D4E77C
feedbackProvided
2021-08-17T14:28:15.083+03:00
4
D4E77C
feedbackProvided
2021-08-17T14:28:17.12+03:00
5
D4E77C
buttonClicked
2021-08-17T14:28:18.383+03:00
6
D4E77C
measurementStarted
2021-08-17T14:28:22.437+03:00
7
D4E77C
buttonClicked
2021-08-17T14:28:23.572+03:00
8
D4E77C
measurementCancelled
2021-08-17T14:28:23.573+03:00
这些只是给定会话的行,假设有很多会话。
我正在尝试计算第一个反馈提供和第一个测量开始之间 recordedAt
的差异。但是,我只希望在测量开始后 3 分钟之内考虑第一次提供的反馈。所以在这种情况下,我们会查看 1 和 6 之间的差异,但时间 > 3 分钟。 2和6,时间>3分钟。 3和6,时间是~7秒。
我第一次看一些分区,我很接近,但我无法计算出 3 分钟的最大时差。
我走对了吗?
WITH firstFeedbackProvided AS (
SELECT
session, type, recordedAt,
ROW_NUMBER() over(partition by session order by recordedAt) rn
FROM events
WHERE type='feedbackProvided'
),
firstMeasurementStarted AS (
SELECT
session, type, recordedAt,
ROW_NUMBER() over(partition by session order by recordedAt) rn
FROM events
WHERE type='measurementStarted'
)
SELECT
*,
date_diff('millisecond', t1.recordedAt, t2.recordedAt) as diff
FROM firstFeedbackProvided as t1
JOIN firstMeasurementStarted as t2 ON t1.session = t2.session
WHERE t1.rn = 1
AND t2.rn = 1
我会建议对间隙和孤岛问题的下一个解释 - 过滤掉不是 measurementStarted
或 feedbackProvided
的所有内容,根据前一行 measurementStarted
创建组,找到组中的最大时间(对于 measurementStarted
应该是一个)并使用它从组中过滤掉 feedbackProvided
条记录。
数据:
WITH dataset AS (
SELECT *
FROM
(
VALUES
('D4E77C', 'feedbackProvided', from_iso8601_timestamp('2021-08-17T09:13:00.768+03:00')),
('D4E77C', 'feedbackProvided', from_iso8601_timestamp('2021-08-17T12:06:03.301+03:00')),
('D4E77C', 'feedbackProvided', from_iso8601_timestamp('2021-08-17T14:28:15.083+03:00')),
('D4E77C', 'feedbackProvided', from_iso8601_timestamp('2021-08-17T14:28:17.12+03:00')),
('D4E77C', 'buttonClicked', from_iso8601_timestamp('2021-08-17T14:28:18.383+03:00')),
('D4E77C', 'measurementStarted', from_iso8601_timestamp('2021-08-17T14:28:22.437+03:00')),
('D4E77C', 'buttonClicked', from_iso8601_timestamp('2021-08-17T14:28:23.572+03:00')),
('D4E77C', 'measurementCancelled', from_iso8601_timestamp('2021-08-17T14:28:23.573+03:00')),
('D4E77C1', 'feedbackProvided', from_iso8601_timestamp('2021-08-17T09:13:00.768+03:00')),
('D4E77C1', 'feedbackProvided', from_iso8601_timestamp('2021-08-17T12:06:03.301+03:00')),
('D4E77C1', 'feedbackProvided', from_iso8601_timestamp('2021-08-17T14:28:15.083+03:00')),
('D4E77C1', 'feedbackProvided', from_iso8601_timestamp('2021-08-17T14:28:17.12+03:00')),
('D4E77C1', 'buttonClicked', from_iso8601_timestamp('2021-08-17T14:28:18.383+03:00')),
('D4E77C1', 'measurementStarted', from_iso8601_timestamp('2021-08-17T14:28:22.437+03:00')),
('D4E77C1', 'buttonClicked', from_iso8601_timestamp('2021-08-17T14:28:23.572+03:00')),
('D4E77C1', 'measurementCancelled', from_iso8601_timestamp('2021-08-17T14:28:23.573+03:00')),
('D4E77C', 'feedbackProvided', from_iso8601_timestamp('2021-08-18T09:13:00.768+03:00')),
('D4E77C', 'feedbackProvided', from_iso8601_timestamp('2021-08-18T12:06:03.301+03:00')),
('D4E77C', 'feedbackProvided', from_iso8601_timestamp('2021-08-18T14:28:15.083+03:00')),
('D4E77C', 'feedbackProvided', from_iso8601_timestamp('2021-08-18T14:28:17.12+03:00')),
('D4E77C', 'buttonClicked', from_iso8601_timestamp('2021-08-18T14:28:18.383+03:00')),
('D4E77C', 'measurementStarted', from_iso8601_timestamp('2021-08-18T14:28:22.437+03:00')),
('D4E77C', 'buttonClicked', from_iso8601_timestamp('2021-08-18T14:28:23.572+03:00')),
('D4E77C', 'measurementCancelled', from_iso8601_timestamp('2021-08-18T14:28:23.573+03:00'))
) AS t (session, type, recordedAt)
)
select session, max(recordedAt) - min(recordedAt)
from (
select *, max(recordedAt) over (partition by session, grp) as m_started_date
from (
select *,
sum(case when prev_type = 'measurementStarted' then 1 else 0 end)
over (partition by session order by recordedAt) as grp
from (
select session,
type,
recordedAt,
lag(type) over (partition by session order by recordedAt) as prev_type
from dataset
where type in ('measurementStarted', 'feedbackProvided')
)
)
)
where m_started_date - recordedAt < interval '3' minute
group by session, grp
输出:
session
_col1
D4E77C1
0 00:00:07.354
D4E77C
0 00:00:07.354
D4E77C
0 00:00:07.354
我认为你把问题复杂化了。执行以下操作:
- 计算每个会话的第一次测量发生的时间。
- 过滤行以仅包含您时间范围内在此之前的反馈事件。
- 汇总
在 SQL 中,这看起来像:
select session,
first_measurementStarted - min(recordedat)
from (select e.*,
min(case when type = 'measurementStarted' then recordedat end) over (partition by session) as first_measurementStarted
from events e
) e
where recordedat > first_measurementStarted - interval '3' minute and
type = 'feedbackProvided'
group by session, first_measurementStarted;
我有一个 table 分析事件,我正在尝试计算两行之间的时间差,即用户在尝试开始和实际开始之间花费的时间。
我的数据是这样的:
# | session | type | recordedAt |
---|---|---|---|
1 | D4E77C | feedbackProvided | 2021-08-17T09:13:00.768+03:00 |
2 | D4E77C | feedbackProvided | 2021-08-17T12:06:03.301+03:00 |
3 | D4E77C | feedbackProvided | 2021-08-17T14:28:15.083+03:00 |
4 | D4E77C | feedbackProvided | 2021-08-17T14:28:17.12+03:00 |
5 | D4E77C | buttonClicked | 2021-08-17T14:28:18.383+03:00 |
6 | D4E77C | measurementStarted | 2021-08-17T14:28:22.437+03:00 |
7 | D4E77C | buttonClicked | 2021-08-17T14:28:23.572+03:00 |
8 | D4E77C | measurementCancelled | 2021-08-17T14:28:23.573+03:00 |
这些只是给定会话的行,假设有很多会话。
我正在尝试计算第一个反馈提供和第一个测量开始之间 recordedAt
的差异。但是,我只希望在测量开始后 3 分钟之内考虑第一次提供的反馈。所以在这种情况下,我们会查看 1 和 6 之间的差异,但时间 > 3 分钟。 2和6,时间>3分钟。 3和6,时间是~7秒。
我第一次看一些分区,我很接近,但我无法计算出 3 分钟的最大时差。
我走对了吗?
WITH firstFeedbackProvided AS (
SELECT
session, type, recordedAt,
ROW_NUMBER() over(partition by session order by recordedAt) rn
FROM events
WHERE type='feedbackProvided'
),
firstMeasurementStarted AS (
SELECT
session, type, recordedAt,
ROW_NUMBER() over(partition by session order by recordedAt) rn
FROM events
WHERE type='measurementStarted'
)
SELECT
*,
date_diff('millisecond', t1.recordedAt, t2.recordedAt) as diff
FROM firstFeedbackProvided as t1
JOIN firstMeasurementStarted as t2 ON t1.session = t2.session
WHERE t1.rn = 1
AND t2.rn = 1
我会建议对间隙和孤岛问题的下一个解释 - 过滤掉不是 measurementStarted
或 feedbackProvided
的所有内容,根据前一行 measurementStarted
创建组,找到组中的最大时间(对于 measurementStarted
应该是一个)并使用它从组中过滤掉 feedbackProvided
条记录。
数据:
WITH dataset AS (
SELECT *
FROM
(
VALUES
('D4E77C', 'feedbackProvided', from_iso8601_timestamp('2021-08-17T09:13:00.768+03:00')),
('D4E77C', 'feedbackProvided', from_iso8601_timestamp('2021-08-17T12:06:03.301+03:00')),
('D4E77C', 'feedbackProvided', from_iso8601_timestamp('2021-08-17T14:28:15.083+03:00')),
('D4E77C', 'feedbackProvided', from_iso8601_timestamp('2021-08-17T14:28:17.12+03:00')),
('D4E77C', 'buttonClicked', from_iso8601_timestamp('2021-08-17T14:28:18.383+03:00')),
('D4E77C', 'measurementStarted', from_iso8601_timestamp('2021-08-17T14:28:22.437+03:00')),
('D4E77C', 'buttonClicked', from_iso8601_timestamp('2021-08-17T14:28:23.572+03:00')),
('D4E77C', 'measurementCancelled', from_iso8601_timestamp('2021-08-17T14:28:23.573+03:00')),
('D4E77C1', 'feedbackProvided', from_iso8601_timestamp('2021-08-17T09:13:00.768+03:00')),
('D4E77C1', 'feedbackProvided', from_iso8601_timestamp('2021-08-17T12:06:03.301+03:00')),
('D4E77C1', 'feedbackProvided', from_iso8601_timestamp('2021-08-17T14:28:15.083+03:00')),
('D4E77C1', 'feedbackProvided', from_iso8601_timestamp('2021-08-17T14:28:17.12+03:00')),
('D4E77C1', 'buttonClicked', from_iso8601_timestamp('2021-08-17T14:28:18.383+03:00')),
('D4E77C1', 'measurementStarted', from_iso8601_timestamp('2021-08-17T14:28:22.437+03:00')),
('D4E77C1', 'buttonClicked', from_iso8601_timestamp('2021-08-17T14:28:23.572+03:00')),
('D4E77C1', 'measurementCancelled', from_iso8601_timestamp('2021-08-17T14:28:23.573+03:00')),
('D4E77C', 'feedbackProvided', from_iso8601_timestamp('2021-08-18T09:13:00.768+03:00')),
('D4E77C', 'feedbackProvided', from_iso8601_timestamp('2021-08-18T12:06:03.301+03:00')),
('D4E77C', 'feedbackProvided', from_iso8601_timestamp('2021-08-18T14:28:15.083+03:00')),
('D4E77C', 'feedbackProvided', from_iso8601_timestamp('2021-08-18T14:28:17.12+03:00')),
('D4E77C', 'buttonClicked', from_iso8601_timestamp('2021-08-18T14:28:18.383+03:00')),
('D4E77C', 'measurementStarted', from_iso8601_timestamp('2021-08-18T14:28:22.437+03:00')),
('D4E77C', 'buttonClicked', from_iso8601_timestamp('2021-08-18T14:28:23.572+03:00')),
('D4E77C', 'measurementCancelled', from_iso8601_timestamp('2021-08-18T14:28:23.573+03:00'))
) AS t (session, type, recordedAt)
)
select session, max(recordedAt) - min(recordedAt)
from (
select *, max(recordedAt) over (partition by session, grp) as m_started_date
from (
select *,
sum(case when prev_type = 'measurementStarted' then 1 else 0 end)
over (partition by session order by recordedAt) as grp
from (
select session,
type,
recordedAt,
lag(type) over (partition by session order by recordedAt) as prev_type
from dataset
where type in ('measurementStarted', 'feedbackProvided')
)
)
)
where m_started_date - recordedAt < interval '3' minute
group by session, grp
输出:
session | _col1 |
---|---|
D4E77C1 | 0 00:00:07.354 |
D4E77C | 0 00:00:07.354 |
D4E77C | 0 00:00:07.354 |
我认为你把问题复杂化了。执行以下操作:
- 计算每个会话的第一次测量发生的时间。
- 过滤行以仅包含您时间范围内在此之前的反馈事件。
- 汇总
在 SQL 中,这看起来像:
select session,
first_measurementStarted - min(recordedat)
from (select e.*,
min(case when type = 'measurementStarted' then recordedat end) over (partition by session) as first_measurementStarted
from events e
) e
where recordedat > first_measurementStarted - interval '3' minute and
type = 'feedbackProvided'
group by session, first_measurementStarted;