如何确定哪个值在 SQL table(雪花)中出现最多并说明关系?
How to identify which value appears the most in a SQL table (Snowflake) AND account for ties?
假设我们有如下数据:
ID tag data timestamp
001 A walter 2021-06-04 09:46:25
005 F junior 2021-06-05 09:47:25
001 B junior 2021-06-04 09:47:25
002 C soprano 2021-06-04 09:48:25
002 C alto 2021-06-04 09:49:25
001 A brown 2021-06-04 09:50:25
003 A cleave 2021-06-04 09:51:25
003 B land 2021-06-04 09:52:25
004 C before 2021-06-04 09:53:25
005 H junior 2021-06-04 09:47:25
我需要知道每个 ID
值中哪个 tag
出现的次数最多。在平局的情况下,请使用 最近的 标记作为 ID,由时间戳表示。
预期结果:
ID tag
001 A
002 C
003 B
004 C
005 F
使用QUALIFY和RANK筛选分组结果:
SELECT ID, tag, COUNT(*) AS cnt, MAX(timestamp) AS max_t
FROM tab
GROUP BY ID, tag
QUALIFY RANK() OVER(PARTITION BY ID ORDER BY cnt DESC, max_t DESC) = 1
示例数据:
CREATE OR REPLACE TABLE tab(ID STRING, tag STRING, data STRING, timestamp TIMESTAMP)
AS
SELECT '001', 'A' ,' walter','2021-06-04 09:46:25'
UNION ALL SELECT '005', 'F' ,' junior','2021-06-05 09:47:25'
UNION ALL SELECT '001', 'B' ,' junior','2021-06-04 09:47:25'
UNION ALL SELECT '002', 'C' ,'soprano','2021-06-04 09:48:25'
UNION ALL SELECT '002', 'C' ,' alto','2021-06-04 09:49:25'
UNION ALL SELECT '001', 'A' ,' brown','2021-06-04 09:50:25'
UNION ALL SELECT '003', 'A' ,' cleave','2021-06-04 09:51:25'
UNION ALL SELECT '003', 'B' ,' land','2021-06-04 09:52:25'
UNION ALL SELECT '004', 'C' ,' before','2021-06-04 09:53:25'
UNION ALL SELECT '005', 'H' ,' junior','2021-06-04 09:47:25';
简化查询:
SELECT ID, tag
FROM tab
GROUP BY ID, tag
QUALIFY RANK() OVER(PARTITION BY ID ORDER BY COUNT(*) DESC, MAX(timestamp) DESC) = 1
ORDER BY ID;
输出:
假设我们有如下数据:
ID tag data timestamp
001 A walter 2021-06-04 09:46:25
005 F junior 2021-06-05 09:47:25
001 B junior 2021-06-04 09:47:25
002 C soprano 2021-06-04 09:48:25
002 C alto 2021-06-04 09:49:25
001 A brown 2021-06-04 09:50:25
003 A cleave 2021-06-04 09:51:25
003 B land 2021-06-04 09:52:25
004 C before 2021-06-04 09:53:25
005 H junior 2021-06-04 09:47:25
我需要知道每个 ID
值中哪个 tag
出现的次数最多。在平局的情况下,请使用 最近的 标记作为 ID,由时间戳表示。
预期结果:
ID tag
001 A
002 C
003 B
004 C
005 F
使用QUALIFY和RANK筛选分组结果:
SELECT ID, tag, COUNT(*) AS cnt, MAX(timestamp) AS max_t
FROM tab
GROUP BY ID, tag
QUALIFY RANK() OVER(PARTITION BY ID ORDER BY cnt DESC, max_t DESC) = 1
示例数据:
CREATE OR REPLACE TABLE tab(ID STRING, tag STRING, data STRING, timestamp TIMESTAMP)
AS
SELECT '001', 'A' ,' walter','2021-06-04 09:46:25'
UNION ALL SELECT '005', 'F' ,' junior','2021-06-05 09:47:25'
UNION ALL SELECT '001', 'B' ,' junior','2021-06-04 09:47:25'
UNION ALL SELECT '002', 'C' ,'soprano','2021-06-04 09:48:25'
UNION ALL SELECT '002', 'C' ,' alto','2021-06-04 09:49:25'
UNION ALL SELECT '001', 'A' ,' brown','2021-06-04 09:50:25'
UNION ALL SELECT '003', 'A' ,' cleave','2021-06-04 09:51:25'
UNION ALL SELECT '003', 'B' ,' land','2021-06-04 09:52:25'
UNION ALL SELECT '004', 'C' ,' before','2021-06-04 09:53:25'
UNION ALL SELECT '005', 'H' ,' junior','2021-06-04 09:47:25';
简化查询:
SELECT ID, tag
FROM tab
GROUP BY ID, tag
QUALIFY RANK() OVER(PARTITION BY ID ORDER BY COUNT(*) DESC, MAX(timestamp) DESC) = 1
ORDER BY ID;
输出: