如何用两个相关的子查询替换 Hive 中的存在
How to replace exist in Hive with two correlated subqueries
我有一个类似这样的查询
SELECT u.id, COUNT(*)
FROM users u, posts p
WHERE u.id = p.owneruserid
AND EXISTS (SELECT COUNT(*) as num
FROM postlinks pl
WHERE pl.postid = p.id
GROUP BY pl.id
HAVING num > 1) --correlated subquery 1
AND EXISTS (SELECT *
FROM comments c
WHERE c.postid = p.id); --correlated subquery 2
GROUP BY u.id
我研究并阅读了 Hive IN
或 EXIST
中不支持的语句。我读到一个解决方法是使用 LEFT JOIN
。我已经试过了,但我在使用 GROUP BY u.id
时遇到了问题。我读到这需要始终与 COUNT()
之类的聚合函数配对,但是我不确定如何重写此查询以使其正常工作。我在网上看到的所有其他例子似乎都没有这个复杂。
如您所说,您可以将它们转换为 left join
或者可能是 left join
,因为它们在两个子查询中都使用 exists
。只需将您的子查询转换为内联视图并将它们与原始表连接起来。
SELECT u.id, COUNT(*)
FROM users u
inner join posts p on u.id = p.owneruserid
left outer join (SELECT COUNT(*) as num, pl.postid postid
FROM postlinks pl
GROUP BY pl.postid
HAVING num > 1) pl ON pl.postid = p.id --correlated subquery 1 with left join
left outer join (SELECT postid FROM comments c GROUP BY postid)c ON c.postid = p.id --correlated subquery 2 with left join
WHERE ( c.postid is not null AND pl.postid is not null) -- this ensure data exists in both subquery
GROUP BY u.id
left join有可能出现重复,可以在subqry2中使用group by来避免。
我有一个类似这样的查询
SELECT u.id, COUNT(*)
FROM users u, posts p
WHERE u.id = p.owneruserid
AND EXISTS (SELECT COUNT(*) as num
FROM postlinks pl
WHERE pl.postid = p.id
GROUP BY pl.id
HAVING num > 1) --correlated subquery 1
AND EXISTS (SELECT *
FROM comments c
WHERE c.postid = p.id); --correlated subquery 2
GROUP BY u.id
我研究并阅读了 Hive IN
或 EXIST
中不支持的语句。我读到一个解决方法是使用 LEFT JOIN
。我已经试过了,但我在使用 GROUP BY u.id
时遇到了问题。我读到这需要始终与 COUNT()
之类的聚合函数配对,但是我不确定如何重写此查询以使其正常工作。我在网上看到的所有其他例子似乎都没有这个复杂。
如您所说,您可以将它们转换为 left join
或者可能是 left join
,因为它们在两个子查询中都使用 exists
。只需将您的子查询转换为内联视图并将它们与原始表连接起来。
SELECT u.id, COUNT(*)
FROM users u
inner join posts p on u.id = p.owneruserid
left outer join (SELECT COUNT(*) as num, pl.postid postid
FROM postlinks pl
GROUP BY pl.postid
HAVING num > 1) pl ON pl.postid = p.id --correlated subquery 1 with left join
left outer join (SELECT postid FROM comments c GROUP BY postid)c ON c.postid = p.id --correlated subquery 2 with left join
WHERE ( c.postid is not null AND pl.postid is not null) -- this ensure data exists in both subquery
GROUP BY u.id
left join有可能出现重复,可以在subqry2中使用group by来避免。