如何从多个表中找到交集的组合?

How to find combination of intersection from many tables?

我有一个列表,列出了可能将用户带到网站的不同渠道(有机、搜索引擎优化、在线营销等)。我想找到一种有效的方法来计算来自这些渠道组合的每日活跃用户。每个频道都有自己的 table 并跟踪其各自的用户。

tables 如下所示,

channel A
date         user_id
2020-08-01   A
2020-08-01   B
2020-08-01   C

channel B
date         user_id
2020-08-01   C
2020-08-01   D
2020-08-01   G

channel C
date         user_id
2020-08-01   A
2020-08-01   C
2020-08-01   F

我想知道以下组合

  1. 只访问频道A
  2. 只访问通道 A 和 B
  3. 只访问频道 B 和 C
  4. 只访问频道B
  5. 等等

但是,当有很多通道时(我有大约8个通道)组合很多。我所做的大致就是这么简单(这个包括通道A)

SELECT 
    a.date, 
    COUNT(DISTINCT IF(b.user_id IS NULL AND c.user_id IS NULL, a.user_id, NULL)) AS dau_a,
    COUNT(DISTINCT IF(b.user_id IS NOT NULL AND c.user_id IS NULL, a.user_id, NULL)) AS dau_a_b,
    ...
FROM a LEFT JOIN b ON a.user_id = b.user_id AND a.date = b.date 
LEFT JOIN c ON a.user_id = c.user_id AND a.date = c.date
GROUP BY 1

但当总通道数为 8 时(2 种组合有 28 种变化,3 种组合有 56 种,4 种组合有 70 种,等等),这会非常乏味。

有什么聪明的想法可以解决这个问题吗?我正在考虑使用 FULL OUTER JOIN 但似乎无法掌握它。非常感谢回答。

我认为您可以使用集合运算符来回答您的问题:https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#set_operators

例如

  1. 是(A除B)除C
  2. A与B相交

等等

我在想full join和聚合:

select date, a.channel_a, b.channel_b, c.channel_c, count(*) cnt
from      (select 'a' channel_a, a.* from channel_a) a
full join (select 'b' channel_b, b.* from channel_b b) b using (date, user_id)
full join (select 'c' channel_c, c.* from channel_c c) c using (date, user_id)
group by date, a.channel_a, b.channel_b, c.channel_c

我会用 union all 和两个聚合级别来解决这个问题:

select date, channels, count(*) as num_users
from (select date, user_id, string_agg(channel order by channel) as channels
      from ((select distinct date, user_id, 'a' as channel from a) union all
            (select distinct date, user_id, 'b' as channel from b) union all
            (select distinct date, user_id, 'c' as channel from c) 
           ) abc
      group by date, user_id
     ) c
group by date, channels;
  

However, when there are a lot of channels (I have around 8 channels) the combination is a lot

extremely tedious when the total channels is 8 (28 variations for 2 combinations, 56 for 3, 70 for 4, and many more).

Any smart ideas to solve this?

以下适用于 BigQuery 标准 SQL 并且恰好解决了 OP 关注的上述方面

#standardSQL
CREATE TEMP FUNCTION generate_combinations(a ARRAY<INT64>) 
RETURNS ARRAY<STRING>
LANGUAGE js AS '''
  var combine = function(a) {
    var fn = function(n, src, got, all) {
      if (n == 0) {
        if (got.length > 0) {
          all[all.length] = got;
        } return;
      }
      for (var j = 0; j < src.length; j++) {
        fn(n - 1, src.slice(j + 1), got.concat([src[j]]), all);
      } return;
    }
    var all = []; for (var i = 1; i < a.length; i++) {
      fn(i, a, [], all);
    }
    all.push(a);
    return all;
  } 
  return combine(a)
''';
with users as (
    select distinct date, user_id, 'A' channel from channel_A union all
    select distinct date, user_id, 'B' from channel_B union all
    select distinct date, user_id, 'C' from channel_C 
), visits as (
  select date, user_id, 
    string_agg(channel, ' & ' order by channel) combination
  from users
  group by date, user_id
), channels AS (
  select channel, cast(row_number() over(order by channel) as string) channel_num
  from (select distinct channel from users)
), combinations as (
  select string_agg(channel, ' & ' order by channel_num) combination
  from unnest(generate_combinations(generate_array(1,(select count(1) from channels)))) AS items, 
    unnest(split(items)) AS channel_num
  join channels using(channel_num)
  group by items
)
select date, 
  combination as channels_visited_only, 
  count(distinct user_id) dau
from visits
join combinations using (combination)
group by date, combination
order by combination

如果应用于您问题中的示例数据 - 输出为

Some explanations to help with using above

  • CTE users 只是简单地联合所有 table 并添加通道列以便能够区分 table 相应行来自

  • CTE visits 为每个 user-date 组合提取所有已访问频道的列表

  • CTE channels 只是简单地准备频道列表并分配编号以备后用

  • CTE combinations使用JS UDF生成频道号的所有组合,然后将它们连接回频道以生成频道组合

  • 最后的SELECT语句只是查找访问频道列表与上一步生成的频道组合相匹配的用户

Some recommendations for further streamlining above code

  • 假设您的频道 table 的名称遵循 channel_* 模式

您可以在 users CTE 中使用通配符 tables 功能,而不是

select distinct date, user_id, 'A' channel from channel_A union all
select distinct date, user_id, 'B' from channel_B union all
select distinct date, user_id, 'C' from channel_C 

你可以使用类似下面的东西 - 所以只有一行而不是你拥有的那么多行

select distinct date, user_id, _TABLE_SUFFIX as channel from channel_*