如何在 SQL 中将来自不同变量的数据组合在一起?
How to combine data from different variables together in SQL?
假设我有这样的数据:
USER_ID TIMESTAMP data data2
0001 2021-05-09 12:13:03.445 44
0001 2021-05-09 13:13:03.445 rob
0001 2021-05-09 11:13:03.445
0002 2021-05-09 09:13:03.445 perry 333
0002 2021-05-09 12:13:03.445 carl 333
0003 2021-05-09 16:13:03.445 mitch 1
0003 2021-05-09 17:13:03.445
0002 2021-05-09 16:13:03.445 mitch 5
我想做的就是从每一列中收集最新的非空值,并将它们压缩成一个 table,每一行都是一个条目。
最终结果:
USER_ID data data2
0001 rob 44
0003 mitch 1
0002 mitch 5
这是我的资料,但还不完整:
WITH form AS (
select b.*,
rank() over (
partition by user_id
order by timestamp DESC
) as num
FROM b
SELECT *
FROM b
WHERE num = 1
嗯。 . .这是 ignore null
s 真正有用的地方——但 Postgres 不支持它(还??)。
相反,您可以使用数组先对非 NULL 值排序,然后再按时间戳排序:
select user_id,
(array_agg(data order by (data is not null) desc, timestamp desc))[1],
(array_agg(data2 order by (data2 is not null) desc, timestamp desc))[1]
from t
group by user_id;
Here 是一个 db<>fiddle.
您可以使用 LAST_VALUE
或 FIRST_VALUE
函数来使用 IGNORE NULL。对于您的数据集:
WITH x AS (
SELECT *
FROM (VALUES ('0001','2021-05-09 12:13:03.445'::timestamp,NULL,44),
('0001','2021-05-09 13:13:03.445'::timestamp,'rob',NULL),
('0001','2021-05-09 11:13:03.445'::timestamp,NULL,NULL),
('0002','2021-05-09 09:13:03.445'::timestamp,'perry',333),
('0002','2021-05-09 12:13:03.445'::timestamp,'carl',333),
('0003','2021-05-09 16:13:03.445'::timestamp,'mitch',1),
('0003','2021-05-09 17:13:03.445'::timestamp,NULL,NULL),
('0002','2021-05-09 16:13:03.445'::timestamp,'mitch',5)
) x (id, ts, data, data2)
)
你会这样做:
SELECT id,
LAST_VALUE(data) IGNORE NULLS OVER (PARTITION BY ID ORDER BY ts) as data_last,
LAST_VALUE(data2) IGNORE NULLS OVER (PARTITION BY ID ORDER BY ts) as data2_last
FROM x
QUALIFY ROW_NUMBER() OVER (PARTITION BY id ORDER BY ts) = 1;
相关:Equivalent for Keep in Snowflake:
可以通过以下方式实现:
WITH cte(user_id, timestamp, "data", data2) AS (
SELECT *
FROM (VALUES ('0001','2021-05-09 12:13:03.445'::timestamp,NULL,44),
('0001','2021-05-09 13:13:03.445'::timestamp,'rob',NULL),
('0001','2021-05-09 11:13:03.445'::timestamp,NULL,NULL),
('0002','2021-05-09 09:13:03.445'::timestamp,'perry',333),
('0002','2021-05-09 12:13:03.445'::timestamp,'carl',333),
('0003','2021-05-09 16:13:03.445'::timestamp,'mitch',1),
('0003','2021-05-09 17:13:03.445'::timestamp,NULL,NULL),
('0002','2021-05-09 16:13:03.445'::timestamp,'mitch',5)
)
)
SELECT user_id,
(ARRAY_AGG("data") WITHIN GROUP (ORDER BY timestamp DESC))[0]::STRING AS "data",
(ARRAY_AGG(data2) WITHIN GROUP (ORDER BY timestamp DESC))[0] AS data2
FROM cte
GROUP BY user_id
ORDER BY user_id;
输出:
+---------+----------+-------+
| USER_ID | data | data2 |
+---------+----------+-------+
| 0001 | rob | 44 |
| 0002 | mitch | 5 |
| 0003 | mitch | 1 |
+---------+----------+-------+
ARRAY_AGG
默认省略 NULL,并按时间戳降序排列。一旦创建了每个 user_id
的数组,它就是访问第一个元素(索引为 [0] 的元素)的问题。
假设我有这样的数据:
USER_ID TIMESTAMP data data2
0001 2021-05-09 12:13:03.445 44
0001 2021-05-09 13:13:03.445 rob
0001 2021-05-09 11:13:03.445
0002 2021-05-09 09:13:03.445 perry 333
0002 2021-05-09 12:13:03.445 carl 333
0003 2021-05-09 16:13:03.445 mitch 1
0003 2021-05-09 17:13:03.445
0002 2021-05-09 16:13:03.445 mitch 5
我想做的就是从每一列中收集最新的非空值,并将它们压缩成一个 table,每一行都是一个条目。
最终结果:
USER_ID data data2
0001 rob 44
0003 mitch 1
0002 mitch 5
这是我的资料,但还不完整:
WITH form AS (
select b.*,
rank() over (
partition by user_id
order by timestamp DESC
) as num
FROM b
SELECT *
FROM b
WHERE num = 1
嗯。 . .这是 ignore null
s 真正有用的地方——但 Postgres 不支持它(还??)。
相反,您可以使用数组先对非 NULL 值排序,然后再按时间戳排序:
select user_id,
(array_agg(data order by (data is not null) desc, timestamp desc))[1],
(array_agg(data2 order by (data2 is not null) desc, timestamp desc))[1]
from t
group by user_id;
Here 是一个 db<>fiddle.
您可以使用 LAST_VALUE
或 FIRST_VALUE
函数来使用 IGNORE NULL。对于您的数据集:
WITH x AS (
SELECT *
FROM (VALUES ('0001','2021-05-09 12:13:03.445'::timestamp,NULL,44),
('0001','2021-05-09 13:13:03.445'::timestamp,'rob',NULL),
('0001','2021-05-09 11:13:03.445'::timestamp,NULL,NULL),
('0002','2021-05-09 09:13:03.445'::timestamp,'perry',333),
('0002','2021-05-09 12:13:03.445'::timestamp,'carl',333),
('0003','2021-05-09 16:13:03.445'::timestamp,'mitch',1),
('0003','2021-05-09 17:13:03.445'::timestamp,NULL,NULL),
('0002','2021-05-09 16:13:03.445'::timestamp,'mitch',5)
) x (id, ts, data, data2)
)
你会这样做:
SELECT id,
LAST_VALUE(data) IGNORE NULLS OVER (PARTITION BY ID ORDER BY ts) as data_last,
LAST_VALUE(data2) IGNORE NULLS OVER (PARTITION BY ID ORDER BY ts) as data2_last
FROM x
QUALIFY ROW_NUMBER() OVER (PARTITION BY id ORDER BY ts) = 1;
相关:Equivalent for Keep in Snowflake:
可以通过以下方式实现:
WITH cte(user_id, timestamp, "data", data2) AS (
SELECT *
FROM (VALUES ('0001','2021-05-09 12:13:03.445'::timestamp,NULL,44),
('0001','2021-05-09 13:13:03.445'::timestamp,'rob',NULL),
('0001','2021-05-09 11:13:03.445'::timestamp,NULL,NULL),
('0002','2021-05-09 09:13:03.445'::timestamp,'perry',333),
('0002','2021-05-09 12:13:03.445'::timestamp,'carl',333),
('0003','2021-05-09 16:13:03.445'::timestamp,'mitch',1),
('0003','2021-05-09 17:13:03.445'::timestamp,NULL,NULL),
('0002','2021-05-09 16:13:03.445'::timestamp,'mitch',5)
)
)
SELECT user_id,
(ARRAY_AGG("data") WITHIN GROUP (ORDER BY timestamp DESC))[0]::STRING AS "data",
(ARRAY_AGG(data2) WITHIN GROUP (ORDER BY timestamp DESC))[0] AS data2
FROM cte
GROUP BY user_id
ORDER BY user_id;
输出:
+---------+----------+-------+
| USER_ID | data | data2 |
+---------+----------+-------+
| 0001 | rob | 44 |
| 0002 | mitch | 5 |
| 0003 | mitch | 1 |
+---------+----------+-------+
ARRAY_AGG
默认省略 NULL,并按时间戳降序排列。一旦创建了每个 user_id
的数组,它就是访问第一个元素(索引为 [0] 的元素)的问题。