如何获取每对列的计数和最新记录?
How can I get the count and newest record for each pair of columns?
我有一个 Athena table 有 4 列 (A, B, C, D)
我想找到:
- 与
A
和 B
的每个唯一组合关联的行数
- 同一对
A
和 B
的最近一行的 C 值,其中 D
是时间戳
例如,如果这是输入数据
+---+---+-----+------------+
| A | B | C | D |
+---+---+-----+------------+
| 1 | 1 | 'a' | 2019-04-04 |
| 1 | 1 | 'b' | 2019-04-03 |
| 1 | 2 | 'c' | 2019-04-02 |
| 1 | 3 | 'd' | 2019-04-01 |
| 2 | 2 | 'e' | 2019-04-03 |
| 2 | 2 | 'f' | 2019-04-04 |
+---+---+-----+------------+
这是期望的输出
+---+---+----------+-------+
| A | B | newest_C | count |
+---+---+----------+-------+
| 1 | 1 | 'a' | 2 |
| 1 | 2 | 'c' | 1 |
| 1 | 3 | 'd' | 1 |
| 2 | 2 | 'f' | 2 |
+---+---+----------+-------+
我不太擅长查询,我最好的尝试如下:
加入两个子查询,其中一个进行计数,另一个按时间对每一行进行排名。然后在连接中,只有 select 排名最高的行。
WITH t1 AS (
SELECT A, B, count(*)
FROM data
GROUP BY A, B
),
t2 AS (
SELECT A, B, C, RANK() OVER (PARTITION BY A, B ORDER BY D DESC) AS rank
FROM data
)
SELECT t1.A, t1.B, t2.newest_C, t1.count
FROM t1 LEFT JOIN t2 ON t1.A = t2.A AND t1.B = t2.B
WHERE rank = 1
这可以使用 Presto window functions 来实现:
SELECT a, b, c AS newest_c, cnt
FROM (
SELECT
t.*,
COUNT(*) OVER(PARTITION BY a, b) AS cnt,
ROW_NUMBER() OVER(PARTITION BY a, b ORDER BY d DESC) AS rn
FROM mytable t
) x WHERE rn = 1
在子查询中,可以使用window函数统计具有相同(a, b)
元组的记录数,并按d
降序排列记录。然后,外部查询过滤每个组中的最新记录。
Presto 有一些复杂的聚合函数。所以:
select a, b, count(*) as cnt,
max_by(c, d)
from t
group by a, b;
max_by()
在documentation.
中有解释
Gordon Linoff 的解决方案是可以的。如果您不想使用 max_by:
的另一种选择
SELECT t1.a, t1.b, t1.c, t2.count
FROM data AS t1
INNER JOIN
(SELECT a, b, count(*) AS count, max(d) AS d
FROM data
GROUP BY a,b) AS t2
ON t1.a = t2.a AND t1.b = t2.b AND t1.d = t2.d
这里是a demo!
我有一个 Athena table 有 4 列 (A, B, C, D)
我想找到:
- 与
A
和B
的每个唯一组合关联的行数
- 同一对
A
和B
的最近一行的 C 值,其中D
是时间戳
例如,如果这是输入数据
+---+---+-----+------------+
| A | B | C | D |
+---+---+-----+------------+
| 1 | 1 | 'a' | 2019-04-04 |
| 1 | 1 | 'b' | 2019-04-03 |
| 1 | 2 | 'c' | 2019-04-02 |
| 1 | 3 | 'd' | 2019-04-01 |
| 2 | 2 | 'e' | 2019-04-03 |
| 2 | 2 | 'f' | 2019-04-04 |
+---+---+-----+------------+
这是期望的输出
+---+---+----------+-------+
| A | B | newest_C | count |
+---+---+----------+-------+
| 1 | 1 | 'a' | 2 |
| 1 | 2 | 'c' | 1 |
| 1 | 3 | 'd' | 1 |
| 2 | 2 | 'f' | 2 |
+---+---+----------+-------+
我不太擅长查询,我最好的尝试如下:
加入两个子查询,其中一个进行计数,另一个按时间对每一行进行排名。然后在连接中,只有 select 排名最高的行。
WITH t1 AS (
SELECT A, B, count(*)
FROM data
GROUP BY A, B
),
t2 AS (
SELECT A, B, C, RANK() OVER (PARTITION BY A, B ORDER BY D DESC) AS rank
FROM data
)
SELECT t1.A, t1.B, t2.newest_C, t1.count
FROM t1 LEFT JOIN t2 ON t1.A = t2.A AND t1.B = t2.B
WHERE rank = 1
这可以使用 Presto window functions 来实现:
SELECT a, b, c AS newest_c, cnt
FROM (
SELECT
t.*,
COUNT(*) OVER(PARTITION BY a, b) AS cnt,
ROW_NUMBER() OVER(PARTITION BY a, b ORDER BY d DESC) AS rn
FROM mytable t
) x WHERE rn = 1
在子查询中,可以使用window函数统计具有相同(a, b)
元组的记录数,并按d
降序排列记录。然后,外部查询过滤每个组中的最新记录。
Presto 有一些复杂的聚合函数。所以:
select a, b, count(*) as cnt,
max_by(c, d)
from t
group by a, b;
max_by()
在documentation.
Gordon Linoff 的解决方案是可以的。如果您不想使用 max_by:
的另一种选择SELECT t1.a, t1.b, t1.c, t2.count
FROM data AS t1
INNER JOIN
(SELECT a, b, count(*) AS count, max(d) AS d
FROM data
GROUP BY a,b) AS t2
ON t1.a = t2.a AND t1.b = t2.b AND t1.d = t2.d
这里是a demo!