组合两个查询,其中一个使用 GROUP BY

Combining two queries where one uses GROUP BY

我有两个 table。 TABLE1 有列:

pers_key
cost
visit

TABLE2 有列:

pers_key
months

首先,我创建一个临时文件 table:

CREATE TABLE temp_table as
SELECT pers_key,SUM(cost) AS sum_cost, COUNT(DISTINCT visit) AS visit_count
FROM TABLE1
GROUP BY pers_key;

然后,我创建 TABLE3:

CREATE TABLE TABLE3 as
SELECT A.pers_key,
B.sum_cost/A.months AS ind1,
B.visit_count/A.months AS ind2
FROM TABLE2 AS A, temp_table AS B
WHERE A.pers_key = B.pers_key

我想知道这里是否有更好的方法可以达到相同的结果。是否可以在不完全创建 temp_table 的情况下在一个查询中执行此操作?也许是这样的:

CREATE TABLE TABLE3 as
SELECT A.pers_key,
(SUM(B.cost)over (partition by B.pers_key))/A.months AS ind1,
(COUNT(B.visit)over (partition by B.pers_key))/A.months AS ind2
FROM TABLE2 AS A, TABLE1 AS B
WHERE A.pers_key = B.pers_key

或者是否需要临时 table 才能获得所需的结果集?

只使用子查询怎么样?

SELECT A.pers_key,
       B.sum_cost / A.months AS ind1,
       B.visit_count / A.months AS ind2
FROM TABLE2 A JOIN
     (SELECT pers_key, SUM(cost) AS sum_cost,
             COUNT(DISTINCT visit) AS visit_count
      FROM TABLE1
      GROUP BY pers_key
     ) B
     ON A.pers_key = B.pers_key;

编辑:

你的问题有点复杂。这绝对是一个合理的做法。将子查询放在 table 中并在 table 上为连接建立索引可能会更快。但是,危险信号是count(distinct)。根据我使用 Hive 的经验,以下查询比上面的子查询更快:

     (SELECT pers_key, SUM(sum_cost) AS sum_cost,
             COUNT(visit) AS visit_count
      FROM (SELECT pers_key, visit, SUM(cost) as sum_cost
            FROM TABLE1
            GROUP BY pers_key, visit
           ) t
      GROUP BY pers_key
     ) B

(对我而言)这个版本更快,这有点违反直觉。但是,发生的是 group by 是 Hive 很容易并行化 group bys。另一方面,count(distinct) 是串行处理的。这有时会发生在其他数据库中(我在 Postgres 中看到类似的行为 count(distinct)。另一个警告:我没有在我发现这个的地方设置 Hive 系统,所以它可能是某种配置问题。