全外连接中缺少行

Missing rows in full outer join

我正在尝试计算在连续 3 天的每一天观察到的用户数。 3 个中间 table(t0t1t2)中的每一个都有 2 列:uid(唯一 ID)和 d0(要么 d1d2,即1,表示该用户当天被观察。

以下查询:

select d0,d1,d2, count(*) as user_count from (
select uid, 1 as d0
from my_table
where day=5 and uid is not Null
group by uid
) as t0 full outer join (
select uid, 1 as d1
from my_table
where day=6 and uid is not Null
group by uid
) as t1 on t0.uid = t1.uid
full outer join (
select uid, 1 as d2
from my_table
where day=7 and uid is not Null
group by uid
) as t2 on t0.uid = t2.uid and t1.uid = t2.uid
group by d0,d1,d2 order by d0,d1,d2

spark.sql(q).toPandas().set_index(["d0","d1","d2"]) 产生这个输出:

          user_count
d0 d1 d2            
0  0  1        73455
   1  0        53345
1  0  0        49254
   1  0         8234
      1        78455

明显缺少两行:0 1 11 0 1为什么?!

PS1。我明白为什么缺少 0 0 0

PS2。 my_table 大致如下所示:

create table my_table (uid integer, day integer);
insert into my_table values
 (1, 5), (1, 6), (1, 7),
 (2, 5), (2, 6),
 (3, 5), (3, 7),
 (4, 6), (4, 7),
 (5, 5),
 (6, 6),
 (7, 7);

对于这个 table 我希望查询 return

          user_count
d0 d1 d2            
0  0  1        1      --- uid = 7
   1  0        1      --- uid = 6
      1        1      --- uid = 4
1  0  0        1      --- uid = 5
      1        1      --- uid = 3
   1  0        1      --- uid = 2
      1        1      --- uid = 1

使用两级聚合代替full join:

select d0, d1, d2, count(*)
from (select uid,
             max(case when day = 5 then 1 else 0 end) as d0,
             max(case when day = 6 then 1 else 0 end) as d1,
             max(case when day = 7 then 1 else 0 end) as d2
      from my_table
      where uid is not Null
      group by uid
     ) u
group by d0, d1, d2;

关于原始查询,最后一个 FULL JOIN 应该考虑到 t0.uid 由于第一个 FULL JOIN 可能为空,所以它必须是 OR 不是 AND。

select d0,d1,d2, count(*) as user_count 
from (
   select uid, 1 as d0
   from my_table
   where day=5 and uid is not Null
   group by uid
) as t0 
full outer join (
   select uid, 1 as d1
   from my_table
   where day=6 and uid is not Null
   group by uid
) as t1 on t0.uid = t1.uid
full outer join (
   select uid, 1 as d2
   from my_table
   where day=7 and uid is not Null
   group by uid
) as t2 on t0.uid = t2.uid or t1.uid = t2.uid
group by d0,d1,d2 
order by d0,d1,d2;

SQL Server db<>fiddle

就我个人而言,我会坚持使用 Gordon Linoff 的解决方案。