组合两个查询,其中一个使用 GROUP BY
Combining two queries where one uses GROUP BY
我有两个 table。 TABLE1 有列:
pers_key
cost
visit
TABLE2 有列:
pers_key
months
首先,我创建一个临时文件 table:
CREATE TABLE temp_table as
SELECT pers_key,SUM(cost) AS sum_cost, COUNT(DISTINCT visit) AS visit_count
FROM TABLE1
GROUP BY pers_key;
然后,我创建 TABLE3:
CREATE TABLE TABLE3 as
SELECT A.pers_key,
B.sum_cost/A.months AS ind1,
B.visit_count/A.months AS ind2
FROM TABLE2 AS A, temp_table AS B
WHERE A.pers_key = B.pers_key
我想知道这里是否有更好的方法可以达到相同的结果。是否可以在不完全创建 temp_table 的情况下在一个查询中执行此操作?也许是这样的:
CREATE TABLE TABLE3 as
SELECT A.pers_key,
(SUM(B.cost)over (partition by B.pers_key))/A.months AS ind1,
(COUNT(B.visit)over (partition by B.pers_key))/A.months AS ind2
FROM TABLE2 AS A, TABLE1 AS B
WHERE A.pers_key = B.pers_key
或者是否需要临时 table 才能获得所需的结果集?
只使用子查询怎么样?
SELECT A.pers_key,
B.sum_cost / A.months AS ind1,
B.visit_count / A.months AS ind2
FROM TABLE2 A JOIN
(SELECT pers_key, SUM(cost) AS sum_cost,
COUNT(DISTINCT visit) AS visit_count
FROM TABLE1
GROUP BY pers_key
) B
ON A.pers_key = B.pers_key;
编辑:
你的问题有点复杂。这绝对是一个合理的做法。将子查询放在 table 中并在 table 上为连接建立索引可能会更快。但是,危险信号是count(distinct)
。根据我使用 Hive 的经验,以下查询比上面的子查询更快:
(SELECT pers_key, SUM(sum_cost) AS sum_cost,
COUNT(visit) AS visit_count
FROM (SELECT pers_key, visit, SUM(cost) as sum_cost
FROM TABLE1
GROUP BY pers_key, visit
) t
GROUP BY pers_key
) B
(对我而言)这个版本更快,这有点违反直觉。但是,发生的是 group by
是 Hive 很容易并行化 group by
s。另一方面,count(distinct)
是串行处理的。这有时会发生在其他数据库中(我在 Postgres 中看到类似的行为 count(distinct)
。另一个警告:我没有在我发现这个的地方设置 Hive 系统,所以它可能是某种配置问题。
我有两个 table。 TABLE1 有列:
pers_key
cost
visit
TABLE2 有列:
pers_key
months
首先,我创建一个临时文件 table:
CREATE TABLE temp_table as
SELECT pers_key,SUM(cost) AS sum_cost, COUNT(DISTINCT visit) AS visit_count
FROM TABLE1
GROUP BY pers_key;
然后,我创建 TABLE3:
CREATE TABLE TABLE3 as
SELECT A.pers_key,
B.sum_cost/A.months AS ind1,
B.visit_count/A.months AS ind2
FROM TABLE2 AS A, temp_table AS B
WHERE A.pers_key = B.pers_key
我想知道这里是否有更好的方法可以达到相同的结果。是否可以在不完全创建 temp_table 的情况下在一个查询中执行此操作?也许是这样的:
CREATE TABLE TABLE3 as
SELECT A.pers_key,
(SUM(B.cost)over (partition by B.pers_key))/A.months AS ind1,
(COUNT(B.visit)over (partition by B.pers_key))/A.months AS ind2
FROM TABLE2 AS A, TABLE1 AS B
WHERE A.pers_key = B.pers_key
或者是否需要临时 table 才能获得所需的结果集?
只使用子查询怎么样?
SELECT A.pers_key,
B.sum_cost / A.months AS ind1,
B.visit_count / A.months AS ind2
FROM TABLE2 A JOIN
(SELECT pers_key, SUM(cost) AS sum_cost,
COUNT(DISTINCT visit) AS visit_count
FROM TABLE1
GROUP BY pers_key
) B
ON A.pers_key = B.pers_key;
编辑:
你的问题有点复杂。这绝对是一个合理的做法。将子查询放在 table 中并在 table 上为连接建立索引可能会更快。但是,危险信号是count(distinct)
。根据我使用 Hive 的经验,以下查询比上面的子查询更快:
(SELECT pers_key, SUM(sum_cost) AS sum_cost,
COUNT(visit) AS visit_count
FROM (SELECT pers_key, visit, SUM(cost) as sum_cost
FROM TABLE1
GROUP BY pers_key, visit
) t
GROUP BY pers_key
) B
(对我而言)这个版本更快,这有点违反直觉。但是,发生的是 group by
是 Hive 很容易并行化 group by
s。另一方面,count(distinct)
是串行处理的。这有时会发生在其他数据库中(我在 Postgres 中看到类似的行为 count(distinct)
。另一个警告:我没有在我发现这个的地方设置 Hive 系统,所以它可能是某种配置问题。