如何用两个相关的子查询替换 Hive 中的存在

How to replace exist in Hive with two correlated subqueries

我有一个类似这样的查询

SELECT u.id, COUNT(*)
FROM users u, posts p
WHERE u.id = p.owneruserid 
  AND EXISTS (SELECT COUNT(*) as num
              FROM postlinks pl
              WHERE pl.postid = p.id
              GROUP BY pl.id
              HAVING num > 1) --correlated subquery 1
  AND EXISTS (SELECT *
              FROM comments c
              WHERE c.postid = p.id); --correlated subquery 2
GROUP BY u.id

我研究并阅读了 Hive INEXIST 中不支持的语句。我读到一个解决方法是使用 LEFT JOIN。我已经试过了,但我在使用 GROUP BY u.id 时遇到了问题。我读到这需要始终与 COUNT() 之类的聚合函数配对,但是我不确定如何重写此查询以使其正常工作。我在网上看到的所有其他例子似乎都没有这个复杂。

如您所说,您可以将它们转换为 left join 或者可能是 left join,因为它们在两个子查询中都使用 exists。只需将您的子查询转换为内联视图并将它们与原始表连接起来。

SELECT u.id, COUNT(*)
FROM users u
inner join  posts p on u.id = p.owneruserid 
left outer join (SELECT COUNT(*) as num, pl.postid postid
              FROM postlinks pl 
              GROUP BY pl.postid
              HAVING num > 1) pl ON pl.postid = p.id --correlated subquery 1 with left join 
left outer join (SELECT postid FROM comments c GROUP BY postid)c ON c.postid = p.id  --correlated subquery 2  with left join 
WHERE ( c.postid is not null AND pl.postid is not null)  -- this ensure data exists in both subquery 
GROUP BY u.id

left join有可能出现重复,可以在subqry2中使用group by来避免。