从第一行、最后一行和 MySQL Window 函数中的聚合获取值
Getting values from the first row, last row, and an aggregation in MySQL Window function
对于与营销相关的分析,我需要提供有关第一个和最后一个接触点以及与我们网站的总交互次数的数据。
我们的 interaction
table 的简化版本如下所示:
create table interaction (
id varchar(36) primary key,
session_id varchar(36) not null,
timestamp timestamp(3) not null,
utm_source varchar(255) null,
utm_medium varchar(255) null
)
我们目前的方法是这样的:
with interaction_ordered as (
select *,
row_number() over (partition by session_id order by timestamp asc) as row_num_asc,
row_number() over (partition by session_id order by timestamp desc) as row_num_desc
from interaction
)
select first_interaction.session_id as session_id,
first_interaction.timestamp as session_start,
timestampdiff(SECOND, first_interaction.timestamp, last_interaction.timestamp) as session_duration,
count(*) as interaction_count,
first_interaction.utm_source as first_touchpoint,
last_interaction.utm_source as last_touchpoint,
last_interaction.utm_medium as last_medium
from interaction_ordered as interaction
join interaction_ordered as first_interaction using (session_id)
join interaction_ordered as last_interaction using (session_id)
where first_interaction.row_num_asc = 1 and last_interaction.row_num_desc = 1
group by session_id
having session_start between ? - interval 1 day and ? + interval 1 day
目前,我们观察到运行时间与我们的数据大致呈线性关系,这将很快变得无法计算。
另一个想法是
select session_id,
min(timestamp) as session_start,
timestampdiff(
SECOND,
min(timestamp),
max(timestamp)
) as session_duration,
count(*) as interaction_count,
first_value(utm_source) over (partition by session_id order by timestamp) as first_touchpoint,
first_value(utm_source) over (partition by session_id order by timestamp desc) as last_touchpoint,
first_value(utm_medium) over (partition by session_id order by timestamp desc) as last_medium
from interaction
group by session_id
having session_start between ? - interval 1 day and ? + interval 1 day
但在我们的实验中,我们从未看到第二个查询完成。因此,我们不能 100% 确定它会产生相同的结果。
我们在 timestamp
和 (session_id, timestamp)
上尝试了索引,但是根据 EXPLAIN
这并没有改变查询计划。
是否有任何快速方法可以从每个 session_id 的第一个和最后一个条目以及每个 session_id 的计数中检索单个属性?
请注意,在我们的真实示例中,有更多我们感兴趣的参数,例如 utm_source
和 utm_medium
。
编辑
示例数据:
insert into interaction values
('a', 'session_1', '2020-06-15T12:00:00.000', 'search.com', 'search'),
('b', 'session_1', '2020-06-15T12:01:00.000', null, null),
('c', 'session_1', '2020-06-15T12:01:30.000', 'social.com', 'social'),
('d', 'session_1', '2020-06-15T12:02:00.250', 'ads.com', 'ads'),
('e', 'session_2', '2020-06-15T14:00:00.000', null, null),
('f', 'session_2', '2020-06-15T14:12:00.000', null, null),
('g', 'session_2', '2020-06-15T14:25:00.000', 'social.com', 'social'),
('h', 'session_3', '2020-06-16T12:05:00.000', 'ads.com', 'ads'),
('i', 'session_3', '2020-06-16T12:05:01.000', null, null),
('j', 'session_4', '2020-06-15T12:00:00.000', null, null),
('k', 'session_5', '2020-06-15T12:00:00.000', 'search.com', 'search');
预期结果:
session_id, session_start, session_duration, interaction_count, first_touchpoint, last_touchpoint, last_medium
session_1, 2020-06-15T12:00:00.000, 120, 4, search.com, ads.com, ads
session_2, 2020-06-15T14:00:00.000, 1500, 3, null, social.com, social
session_3, 2020-06-16T12:05:00.000, 1, 2, ads.com, null, null
session_4, 2020-06-15T12:00:00.000, 0, 1, null, null, null
session_5, 2020-06-15T12:00:00.000, 0, 1, search.com, search.com, search
我注意到我的第二个查询没有产生预期的结果。 last_touchpoint
和 last_medium
被第一个值填充。
我试过了
first_value(utm_source) over (partition by session_id order by timestamp desc) as last_touchpoint,
和
last_value(utm_source) over (partition by session_id order by timestamp range between unbounded preceding and unbounded following) as last_touchpoint,
WITH cte AS ( SELECT *,
FIRST_VALUE(utm_source) OVER (PARTITION BY session_id ORDER BY `timestamp` ASC) first_touchpoint,
FIRST_VALUE(utm_source) OVER (PARTITION BY session_id ORDER BY `timestamp` DESC) last_touchpoint,
FIRST_VALUE(utm_medium) OVER (PARTITION BY session_id ORDER BY `timestamp` DESC) last_medium
FROM interaction
)
SELECT session_id,
MIN(`timestamp`) session_start,
TIMESTAMPDIFF(SECOND, MIN(`timestamp`), MAX(`timestamp`)) session_duration,
COUNT(*) interaction_count,
ANY_VALUE( first_touchpoint ) first_touchpoint,
ANY_VALUE( last_touchpoint ) last_touchpoint,
ANY_VALUE( last_medium ) last_medium
FROM cte
GROUP BY session_id;
使查询可伸缩的唯一方法是使用 where
子句减少正在处理的数据量。如果我假设会话不会持续超过一天,那么我可以将计算的时间范围延长一天并使用 window 函数。结果是这样的:
select s.*
from (select i.*,
min(timestamp) over (partition by session_id) as session_start,
count(*) over (partition by session_id) as interaction_count,
first_value(utm_source) over (partition by session_id order by timestamp) as first_touchpoint,
first_value(utm_source) over (partition by session_id order by timestamp desc) as last_touchpoint,
first_value(utm_medium) over (partition by session_id order by timestamp desc) as last_medium
from interaction i
where timestamp between ? - interval 2 day and ? + interval 2 day
) s
where timestamp = session_start and
session_start between ? - interval 1 day and ? + interval 1 day;
您对 first_value()
的使用应该会返回一个错误——它违反了 MySQL 8+ 默认设置的 "full group by" 的规则。毫不奇怪,语法错误的代码不起作用。
对于与营销相关的分析,我需要提供有关第一个和最后一个接触点以及与我们网站的总交互次数的数据。
我们的 interaction
table 的简化版本如下所示:
create table interaction (
id varchar(36) primary key,
session_id varchar(36) not null,
timestamp timestamp(3) not null,
utm_source varchar(255) null,
utm_medium varchar(255) null
)
我们目前的方法是这样的:
with interaction_ordered as (
select *,
row_number() over (partition by session_id order by timestamp asc) as row_num_asc,
row_number() over (partition by session_id order by timestamp desc) as row_num_desc
from interaction
)
select first_interaction.session_id as session_id,
first_interaction.timestamp as session_start,
timestampdiff(SECOND, first_interaction.timestamp, last_interaction.timestamp) as session_duration,
count(*) as interaction_count,
first_interaction.utm_source as first_touchpoint,
last_interaction.utm_source as last_touchpoint,
last_interaction.utm_medium as last_medium
from interaction_ordered as interaction
join interaction_ordered as first_interaction using (session_id)
join interaction_ordered as last_interaction using (session_id)
where first_interaction.row_num_asc = 1 and last_interaction.row_num_desc = 1
group by session_id
having session_start between ? - interval 1 day and ? + interval 1 day
目前,我们观察到运行时间与我们的数据大致呈线性关系,这将很快变得无法计算。
另一个想法是
select session_id,
min(timestamp) as session_start,
timestampdiff(
SECOND,
min(timestamp),
max(timestamp)
) as session_duration,
count(*) as interaction_count,
first_value(utm_source) over (partition by session_id order by timestamp) as first_touchpoint,
first_value(utm_source) over (partition by session_id order by timestamp desc) as last_touchpoint,
first_value(utm_medium) over (partition by session_id order by timestamp desc) as last_medium
from interaction
group by session_id
having session_start between ? - interval 1 day and ? + interval 1 day
但在我们的实验中,我们从未看到第二个查询完成。因此,我们不能 100% 确定它会产生相同的结果。
我们在 timestamp
和 (session_id, timestamp)
上尝试了索引,但是根据 EXPLAIN
这并没有改变查询计划。
是否有任何快速方法可以从每个 session_id 的第一个和最后一个条目以及每个 session_id 的计数中检索单个属性?
请注意,在我们的真实示例中,有更多我们感兴趣的参数,例如 utm_source
和 utm_medium
。
编辑
示例数据:
insert into interaction values
('a', 'session_1', '2020-06-15T12:00:00.000', 'search.com', 'search'),
('b', 'session_1', '2020-06-15T12:01:00.000', null, null),
('c', 'session_1', '2020-06-15T12:01:30.000', 'social.com', 'social'),
('d', 'session_1', '2020-06-15T12:02:00.250', 'ads.com', 'ads'),
('e', 'session_2', '2020-06-15T14:00:00.000', null, null),
('f', 'session_2', '2020-06-15T14:12:00.000', null, null),
('g', 'session_2', '2020-06-15T14:25:00.000', 'social.com', 'social'),
('h', 'session_3', '2020-06-16T12:05:00.000', 'ads.com', 'ads'),
('i', 'session_3', '2020-06-16T12:05:01.000', null, null),
('j', 'session_4', '2020-06-15T12:00:00.000', null, null),
('k', 'session_5', '2020-06-15T12:00:00.000', 'search.com', 'search');
预期结果:
session_id, session_start, session_duration, interaction_count, first_touchpoint, last_touchpoint, last_medium
session_1, 2020-06-15T12:00:00.000, 120, 4, search.com, ads.com, ads
session_2, 2020-06-15T14:00:00.000, 1500, 3, null, social.com, social
session_3, 2020-06-16T12:05:00.000, 1, 2, ads.com, null, null
session_4, 2020-06-15T12:00:00.000, 0, 1, null, null, null
session_5, 2020-06-15T12:00:00.000, 0, 1, search.com, search.com, search
我注意到我的第二个查询没有产生预期的结果。 last_touchpoint
和 last_medium
被第一个值填充。
我试过了
first_value(utm_source) over (partition by session_id order by timestamp desc) as last_touchpoint,
和last_value(utm_source) over (partition by session_id order by timestamp range between unbounded preceding and unbounded following) as last_touchpoint,
WITH cte AS ( SELECT *,
FIRST_VALUE(utm_source) OVER (PARTITION BY session_id ORDER BY `timestamp` ASC) first_touchpoint,
FIRST_VALUE(utm_source) OVER (PARTITION BY session_id ORDER BY `timestamp` DESC) last_touchpoint,
FIRST_VALUE(utm_medium) OVER (PARTITION BY session_id ORDER BY `timestamp` DESC) last_medium
FROM interaction
)
SELECT session_id,
MIN(`timestamp`) session_start,
TIMESTAMPDIFF(SECOND, MIN(`timestamp`), MAX(`timestamp`)) session_duration,
COUNT(*) interaction_count,
ANY_VALUE( first_touchpoint ) first_touchpoint,
ANY_VALUE( last_touchpoint ) last_touchpoint,
ANY_VALUE( last_medium ) last_medium
FROM cte
GROUP BY session_id;
使查询可伸缩的唯一方法是使用 where
子句减少正在处理的数据量。如果我假设会话不会持续超过一天,那么我可以将计算的时间范围延长一天并使用 window 函数。结果是这样的:
select s.*
from (select i.*,
min(timestamp) over (partition by session_id) as session_start,
count(*) over (partition by session_id) as interaction_count,
first_value(utm_source) over (partition by session_id order by timestamp) as first_touchpoint,
first_value(utm_source) over (partition by session_id order by timestamp desc) as last_touchpoint,
first_value(utm_medium) over (partition by session_id order by timestamp desc) as last_medium
from interaction i
where timestamp between ? - interval 2 day and ? + interval 2 day
) s
where timestamp = session_start and
session_start between ? - interval 1 day and ? + interval 1 day;
您对 first_value()
的使用应该会返回一个错误——它违反了 MySQL 8+ 默认设置的 "full group by" 的规则。毫不奇怪,语法错误的代码不起作用。