全外连接中缺少行
Missing rows in full outer join
我正在尝试计算在连续 3 天的每一天观察到的用户数。
3 个中间 table(t0
、t1
、t2
)中的每一个都有 2 列:uid
(唯一 ID)和 d0
(要么
d1
或d2
,即1,表示该用户当天被观察。
以下查询:
select d0,d1,d2, count(*) as user_count from (
select uid, 1 as d0
from my_table
where day=5 and uid is not Null
group by uid
) as t0 full outer join (
select uid, 1 as d1
from my_table
where day=6 and uid is not Null
group by uid
) as t1 on t0.uid = t1.uid
full outer join (
select uid, 1 as d2
from my_table
where day=7 and uid is not Null
group by uid
) as t2 on t0.uid = t2.uid and t1.uid = t2.uid
group by d0,d1,d2 order by d0,d1,d2
从 spark.sql(q).toPandas().set_index(["d0","d1","d2"])
产生这个输出:
user_count
d0 d1 d2
0 0 1 73455
1 0 53345
1 0 0 49254
1 0 8234
1 78455
明显缺少两行:0 1 1
和 1 0 1
。 为什么?!
PS1。我明白为什么缺少 0 0 0
。
PS2。 my_table
大致如下所示:
create table my_table (uid integer, day integer);
insert into my_table values
(1, 5), (1, 6), (1, 7),
(2, 5), (2, 6),
(3, 5), (3, 7),
(4, 6), (4, 7),
(5, 5),
(6, 6),
(7, 7);
对于这个 table 我希望查询 return
user_count
d0 d1 d2
0 0 1 1 --- uid = 7
1 0 1 --- uid = 6
1 1 --- uid = 4
1 0 0 1 --- uid = 5
1 1 --- uid = 3
1 0 1 --- uid = 2
1 1 --- uid = 1
使用两级聚合代替full join
:
select d0, d1, d2, count(*)
from (select uid,
max(case when day = 5 then 1 else 0 end) as d0,
max(case when day = 6 then 1 else 0 end) as d1,
max(case when day = 7 then 1 else 0 end) as d2
from my_table
where uid is not Null
group by uid
) u
group by d0, d1, d2;
关于原始查询,最后一个 FULL JOIN
应该考虑到 t0.uid
由于第一个 FULL JOIN
可能为空,所以它必须是 OR 不是 AND。
select d0,d1,d2, count(*) as user_count
from (
select uid, 1 as d0
from my_table
where day=5 and uid is not Null
group by uid
) as t0
full outer join (
select uid, 1 as d1
from my_table
where day=6 and uid is not Null
group by uid
) as t1 on t0.uid = t1.uid
full outer join (
select uid, 1 as d2
from my_table
where day=7 and uid is not Null
group by uid
) as t2 on t0.uid = t2.uid or t1.uid = t2.uid
group by d0,d1,d2
order by d0,d1,d2;
就我个人而言,我会坚持使用 Gordon Linoff 的解决方案。
我正在尝试计算在连续 3 天的每一天观察到的用户数。
3 个中间 table(t0
、t1
、t2
)中的每一个都有 2 列:uid
(唯一 ID)和 d0
(要么
d1
或d2
,即1,表示该用户当天被观察。
以下查询:
select d0,d1,d2, count(*) as user_count from (
select uid, 1 as d0
from my_table
where day=5 and uid is not Null
group by uid
) as t0 full outer join (
select uid, 1 as d1
from my_table
where day=6 and uid is not Null
group by uid
) as t1 on t0.uid = t1.uid
full outer join (
select uid, 1 as d2
from my_table
where day=7 and uid is not Null
group by uid
) as t2 on t0.uid = t2.uid and t1.uid = t2.uid
group by d0,d1,d2 order by d0,d1,d2
从 spark.sql(q).toPandas().set_index(["d0","d1","d2"])
产生这个输出:
user_count
d0 d1 d2
0 0 1 73455
1 0 53345
1 0 0 49254
1 0 8234
1 78455
明显缺少两行:0 1 1
和 1 0 1
。 为什么?!
PS1。我明白为什么缺少 0 0 0
。
PS2。 my_table
大致如下所示:
create table my_table (uid integer, day integer);
insert into my_table values
(1, 5), (1, 6), (1, 7),
(2, 5), (2, 6),
(3, 5), (3, 7),
(4, 6), (4, 7),
(5, 5),
(6, 6),
(7, 7);
对于这个 table 我希望查询 return
user_count
d0 d1 d2
0 0 1 1 --- uid = 7
1 0 1 --- uid = 6
1 1 --- uid = 4
1 0 0 1 --- uid = 5
1 1 --- uid = 3
1 0 1 --- uid = 2
1 1 --- uid = 1
使用两级聚合代替full join
:
select d0, d1, d2, count(*)
from (select uid,
max(case when day = 5 then 1 else 0 end) as d0,
max(case when day = 6 then 1 else 0 end) as d1,
max(case when day = 7 then 1 else 0 end) as d2
from my_table
where uid is not Null
group by uid
) u
group by d0, d1, d2;
关于原始查询,最后一个 FULL JOIN
应该考虑到 t0.uid
由于第一个 FULL JOIN
可能为空,所以它必须是 OR 不是 AND。
select d0,d1,d2, count(*) as user_count
from (
select uid, 1 as d0
from my_table
where day=5 and uid is not Null
group by uid
) as t0
full outer join (
select uid, 1 as d1
from my_table
where day=6 and uid is not Null
group by uid
) as t1 on t0.uid = t1.uid
full outer join (
select uid, 1 as d2
from my_table
where day=7 and uid is not Null
group by uid
) as t2 on t0.uid = t2.uid or t1.uid = t2.uid
group by d0,d1,d2
order by d0,d1,d2;
就我个人而言,我会坚持使用 Gordon Linoff 的解决方案。