如何从多个表中找到交集的组合?
How to find combination of intersection from many tables?
我有一个列表,列出了可能将用户带到网站的不同渠道(有机、搜索引擎优化、在线营销等)。我想找到一种有效的方法来计算来自这些渠道组合的每日活跃用户。每个频道都有自己的 table 并跟踪其各自的用户。
tables 如下所示,
channel A
date user_id
2020-08-01 A
2020-08-01 B
2020-08-01 C
channel B
date user_id
2020-08-01 C
2020-08-01 D
2020-08-01 G
channel C
date user_id
2020-08-01 A
2020-08-01 C
2020-08-01 F
我想知道以下组合
- 只访问频道A
- 只访问通道 A 和 B
- 只访问频道 B 和 C
- 只访问频道B
- 等等
但是,当有很多通道时(我有大约8个通道)组合很多。我所做的大致就是这么简单(这个包括通道A)
SELECT
a.date,
COUNT(DISTINCT IF(b.user_id IS NULL AND c.user_id IS NULL, a.user_id, NULL)) AS dau_a,
COUNT(DISTINCT IF(b.user_id IS NOT NULL AND c.user_id IS NULL, a.user_id, NULL)) AS dau_a_b,
...
FROM a LEFT JOIN b ON a.user_id = b.user_id AND a.date = b.date
LEFT JOIN c ON a.user_id = c.user_id AND a.date = c.date
GROUP BY 1
但当总通道数为 8 时(2 种组合有 28 种变化,3 种组合有 56 种,4 种组合有 70 种,等等),这会非常乏味。
有什么聪明的想法可以解决这个问题吗?我正在考虑使用 FULL OUTER JOIN
但似乎无法掌握它。非常感谢回答。
我认为您可以使用集合运算符来回答您的问题:https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#set_operators
例如
- 是(A除B)除C
- A与B相交
等等
我在想full join
和聚合:
select date, a.channel_a, b.channel_b, c.channel_c, count(*) cnt
from (select 'a' channel_a, a.* from channel_a) a
full join (select 'b' channel_b, b.* from channel_b b) b using (date, user_id)
full join (select 'c' channel_c, c.* from channel_c c) c using (date, user_id)
group by date, a.channel_a, b.channel_b, c.channel_c
我会用 union all
和两个聚合级别来解决这个问题:
select date, channels, count(*) as num_users
from (select date, user_id, string_agg(channel order by channel) as channels
from ((select distinct date, user_id, 'a' as channel from a) union all
(select distinct date, user_id, 'b' as channel from b) union all
(select distinct date, user_id, 'c' as channel from c)
) abc
group by date, user_id
) c
group by date, channels;
However, when there are a lot of channels (I have around 8 channels) the combination is a lot
extremely tedious when the total channels is 8 (28 variations for 2 combinations, 56 for 3, 70 for 4, and many more).
Any smart ideas to solve this?
以下适用于 BigQuery 标准 SQL 并且恰好解决了 OP 关注的上述方面
#standardSQL
CREATE TEMP FUNCTION generate_combinations(a ARRAY<INT64>)
RETURNS ARRAY<STRING>
LANGUAGE js AS '''
var combine = function(a) {
var fn = function(n, src, got, all) {
if (n == 0) {
if (got.length > 0) {
all[all.length] = got;
} return;
}
for (var j = 0; j < src.length; j++) {
fn(n - 1, src.slice(j + 1), got.concat([src[j]]), all);
} return;
}
var all = []; for (var i = 1; i < a.length; i++) {
fn(i, a, [], all);
}
all.push(a);
return all;
}
return combine(a)
''';
with users as (
select distinct date, user_id, 'A' channel from channel_A union all
select distinct date, user_id, 'B' from channel_B union all
select distinct date, user_id, 'C' from channel_C
), visits as (
select date, user_id,
string_agg(channel, ' & ' order by channel) combination
from users
group by date, user_id
), channels AS (
select channel, cast(row_number() over(order by channel) as string) channel_num
from (select distinct channel from users)
), combinations as (
select string_agg(channel, ' & ' order by channel_num) combination
from unnest(generate_combinations(generate_array(1,(select count(1) from channels)))) AS items,
unnest(split(items)) AS channel_num
join channels using(channel_num)
group by items
)
select date,
combination as channels_visited_only,
count(distinct user_id) dau
from visits
join combinations using (combination)
group by date, combination
order by combination
如果应用于您问题中的示例数据 - 输出为
Some explanations to help with using above
CTE users
只是简单地联合所有 table 并添加通道列以便能够区分 table 相应行来自
CTE visits
为每个 user-date 组合提取所有已访问频道的列表
CTE channels
只是简单地准备频道列表并分配编号以备后用
CTE combinations
使用JS UDF生成频道号的所有组合,然后将它们连接回频道以生成频道组合
最后的SELECT语句只是查找访问频道列表与上一步生成的频道组合相匹配的用户
Some recommendations for further streamlining above code
- 假设您的频道 table 的名称遵循
channel_*
模式
您可以在 users
CTE 中使用通配符 tables 功能,而不是
select distinct date, user_id, 'A' channel from channel_A union all
select distinct date, user_id, 'B' from channel_B union all
select distinct date, user_id, 'C' from channel_C
你可以使用类似下面的东西 - 所以只有一行而不是你拥有的那么多行
select distinct date, user_id, _TABLE_SUFFIX as channel from channel_*
我有一个列表,列出了可能将用户带到网站的不同渠道(有机、搜索引擎优化、在线营销等)。我想找到一种有效的方法来计算来自这些渠道组合的每日活跃用户。每个频道都有自己的 table 并跟踪其各自的用户。
tables 如下所示,
channel A
date user_id
2020-08-01 A
2020-08-01 B
2020-08-01 C
channel B
date user_id
2020-08-01 C
2020-08-01 D
2020-08-01 G
channel C
date user_id
2020-08-01 A
2020-08-01 C
2020-08-01 F
我想知道以下组合
- 只访问频道A
- 只访问通道 A 和 B
- 只访问频道 B 和 C
- 只访问频道B
- 等等
但是,当有很多通道时(我有大约8个通道)组合很多。我所做的大致就是这么简单(这个包括通道A)
SELECT
a.date,
COUNT(DISTINCT IF(b.user_id IS NULL AND c.user_id IS NULL, a.user_id, NULL)) AS dau_a,
COUNT(DISTINCT IF(b.user_id IS NOT NULL AND c.user_id IS NULL, a.user_id, NULL)) AS dau_a_b,
...
FROM a LEFT JOIN b ON a.user_id = b.user_id AND a.date = b.date
LEFT JOIN c ON a.user_id = c.user_id AND a.date = c.date
GROUP BY 1
但当总通道数为 8 时(2 种组合有 28 种变化,3 种组合有 56 种,4 种组合有 70 种,等等),这会非常乏味。
有什么聪明的想法可以解决这个问题吗?我正在考虑使用 FULL OUTER JOIN
但似乎无法掌握它。非常感谢回答。
我认为您可以使用集合运算符来回答您的问题:https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#set_operators
例如
- 是(A除B)除C
- A与B相交
等等
我在想full join
和聚合:
select date, a.channel_a, b.channel_b, c.channel_c, count(*) cnt
from (select 'a' channel_a, a.* from channel_a) a
full join (select 'b' channel_b, b.* from channel_b b) b using (date, user_id)
full join (select 'c' channel_c, c.* from channel_c c) c using (date, user_id)
group by date, a.channel_a, b.channel_b, c.channel_c
我会用 union all
和两个聚合级别来解决这个问题:
select date, channels, count(*) as num_users
from (select date, user_id, string_agg(channel order by channel) as channels
from ((select distinct date, user_id, 'a' as channel from a) union all
(select distinct date, user_id, 'b' as channel from b) union all
(select distinct date, user_id, 'c' as channel from c)
) abc
group by date, user_id
) c
group by date, channels;
However, when there are a lot of channels (I have around 8 channels) the combination is a lot
extremely tedious when the total channels is 8 (28 variations for 2 combinations, 56 for 3, 70 for 4, and many more).
Any smart ideas to solve this?
以下适用于 BigQuery 标准 SQL 并且恰好解决了 OP 关注的上述方面
#standardSQL
CREATE TEMP FUNCTION generate_combinations(a ARRAY<INT64>)
RETURNS ARRAY<STRING>
LANGUAGE js AS '''
var combine = function(a) {
var fn = function(n, src, got, all) {
if (n == 0) {
if (got.length > 0) {
all[all.length] = got;
} return;
}
for (var j = 0; j < src.length; j++) {
fn(n - 1, src.slice(j + 1), got.concat([src[j]]), all);
} return;
}
var all = []; for (var i = 1; i < a.length; i++) {
fn(i, a, [], all);
}
all.push(a);
return all;
}
return combine(a)
''';
with users as (
select distinct date, user_id, 'A' channel from channel_A union all
select distinct date, user_id, 'B' from channel_B union all
select distinct date, user_id, 'C' from channel_C
), visits as (
select date, user_id,
string_agg(channel, ' & ' order by channel) combination
from users
group by date, user_id
), channels AS (
select channel, cast(row_number() over(order by channel) as string) channel_num
from (select distinct channel from users)
), combinations as (
select string_agg(channel, ' & ' order by channel_num) combination
from unnest(generate_combinations(generate_array(1,(select count(1) from channels)))) AS items,
unnest(split(items)) AS channel_num
join channels using(channel_num)
group by items
)
select date,
combination as channels_visited_only,
count(distinct user_id) dau
from visits
join combinations using (combination)
group by date, combination
order by combination
如果应用于您问题中的示例数据 - 输出为
Some explanations to help with using above
CTE
users
只是简单地联合所有 table 并添加通道列以便能够区分 table 相应行来自CTE
visits
为每个 user-date 组合提取所有已访问频道的列表CTE
channels
只是简单地准备频道列表并分配编号以备后用CTE
combinations
使用JS UDF生成频道号的所有组合,然后将它们连接回频道以生成频道组合最后的SELECT语句只是查找访问频道列表与上一步生成的频道组合相匹配的用户
Some recommendations for further streamlining above code
- 假设您的频道 table 的名称遵循
channel_*
模式
您可以在 users
CTE 中使用通配符 tables 功能,而不是
select distinct date, user_id, 'A' channel from channel_A union all
select distinct date, user_id, 'B' from channel_B union all
select distinct date, user_id, 'C' from channel_C
你可以使用类似下面的东西 - 所以只有一行而不是你拥有的那么多行
select distinct date, user_id, _TABLE_SUFFIX as channel from channel_*