如何摆脱 Hive/Impala 中的重复计数

How to Get Rid of Duplicate Counts in Hive/Impala

我正在尝试计算 Impala/Hive 中三个 table 的特定列的总值,但我似乎只能获得每个 table 的总值。例如,我收到的是波兰每个 table 的计数,而不是波兰所有三个 table 的计数。我曾尝试将 table 合并在一起,但没有成功。下面列出的是我使用过的编码。

SELECT table1.country, COUNT(*)  
FROM table1 
GROUP BY table1.country  
UNION 
SELECT table2.country, COUNT(*) 
FROM table2 
GROUP BY table2.country 
UNION 
SELECT table3.country, COUNT(*)  
FROM table3
GROUP BY table3.country
ORDER BY COUNT(country) DESC;

使用UNION ALL代替UNION

SELECT table1.country, COUNT(*)  
FROM table1 
GROUP BY table1.country  
UNION ALL
SELECT table2.country, COUNT(*) 
FROM table2 
GROUP BY table2.country 
UNION ALL
SELECT table3.country, COUNT(*)  
FROM table3
GROUP BY table3.country
ORDER BY COUNT(country) DESC;

UNION 删除重复项,因此如果两个表对一个国家/地区的计数相同,则删除重复项。

编辑:

如果您希望每个国家/地区一行,请使用子查询并重新聚合:

SELECT country, SUM(cnt)
FROM (SELECT table1.country, COUNT(*) as cnt
      FROM table1 
      GROUP BY table1.country  
      UNION ALL
      SELECT table2.country, COUNT(*) 
      FROM table2 
      GROUP BY table2.country 
      UNION ALL
      SELECT table3.country, COUNT(*)  
      FROM table3
      GROUP BY table3.country
     ) t
GROUP BY country;