根据最近的 state/attribute 值检索记录
Retrieve records against most recent state/attribute value
在 Redshift 中采用非规范化结构,计划继续创建记录,并且在检索时仅考虑针对用户的最新属性。
以下是table:
user_id state created_at
1 A 15-10-2015 02:00:00 AM
2 A 15-10-2015 02:00:01 AM
3 A 15-10-2015 02:00:02 AM
1 B 15-10-2015 02:00:03 AM
4 A 15-10-2015 02:00:04 AM
5 B 15-10-2015 02:00:05 AM
所需的结果集是:
user_id state created_at
2 A 15-10-2015 02:00:01 AM
3 A 15-10-2015 02:00:02 AM
4 A 15-10-2015 02:00:04 AM
我有检索上述结果的查询:
select user_id, first_value AS state
from (
select user_id, first_value(state) OVER (
PARTITION BY user_id
ORDER BY created_at desc
ROWS between UNBOUNDED PRECEDING and CURRENT ROW)
from customer_properties
order by created_at) t
where first_value = 'A'
这是最好的检索方式还是可以改进查询?
最佳查询取决于各种细节:查询谓词的选择性、基数、数据分布。如果 state = 'A'
是一个选择性条件(查看行符合条件),这个查询应该快得多:
SELECT c.user_id, c.state
FROM customer_properties c
LEFT JOIN customer_properties c1 ON c1.user_id = c.user_id
AND c1.created_at > c.created_at
WHERE c.state = 'A'
AND c1.user_id IS NULL;
提供,在(state)
(甚至(state, user_id, created_at)
)上有一个索引,在(user_id, created_at)
.[=26=上有一个索引]
有多种方法可以确保该行的更新版本不存在:
- Select rows which are not present in other table
如果 'A'
是 state
中的常用值,这个更通用的查询会更快:
SELECT user_id, state
FROM (
SELECT user_id, state
, row_number() OVER (PARTITION BY user_id ORDER BY created_at DESC) AS rn
FROM customer_properties
) t
WHERE t.rn = 1
AND t.state = 'A';
我删除了NULLS LAST
,假设created_at
定义为NOT NULL
。另外,我认为 Redshift 没有它:
- PostgreSQL sort by datetime asc, null first?
这两个查询都应该适用于 Redshift 的有限功能。使用现代 Postgres,有更好的选择:
- Select first row in each GROUP BY group?
- Optimize GROUP BY query to retrieve latest record per user
如果最新的行匹配,您的原始文件将 return 所有 行每 user_id
。你将不得不折叠重复的,不必要的工作......
在 Redshift 中采用非规范化结构,计划继续创建记录,并且在检索时仅考虑针对用户的最新属性。
以下是table:
user_id state created_at
1 A 15-10-2015 02:00:00 AM
2 A 15-10-2015 02:00:01 AM
3 A 15-10-2015 02:00:02 AM
1 B 15-10-2015 02:00:03 AM
4 A 15-10-2015 02:00:04 AM
5 B 15-10-2015 02:00:05 AM
所需的结果集是:
user_id state created_at
2 A 15-10-2015 02:00:01 AM
3 A 15-10-2015 02:00:02 AM
4 A 15-10-2015 02:00:04 AM
我有检索上述结果的查询:
select user_id, first_value AS state
from (
select user_id, first_value(state) OVER (
PARTITION BY user_id
ORDER BY created_at desc
ROWS between UNBOUNDED PRECEDING and CURRENT ROW)
from customer_properties
order by created_at) t
where first_value = 'A'
这是最好的检索方式还是可以改进查询?
最佳查询取决于各种细节:查询谓词的选择性、基数、数据分布。如果 state = 'A'
是一个选择性条件(查看行符合条件),这个查询应该快得多:
SELECT c.user_id, c.state
FROM customer_properties c
LEFT JOIN customer_properties c1 ON c1.user_id = c.user_id
AND c1.created_at > c.created_at
WHERE c.state = 'A'
AND c1.user_id IS NULL;
提供,在(state)
(甚至(state, user_id, created_at)
)上有一个索引,在(user_id, created_at)
.[=26=上有一个索引]
有多种方法可以确保该行的更新版本不存在:
- Select rows which are not present in other table
如果 'A'
是 state
中的常用值,这个更通用的查询会更快:
SELECT user_id, state
FROM (
SELECT user_id, state
, row_number() OVER (PARTITION BY user_id ORDER BY created_at DESC) AS rn
FROM customer_properties
) t
WHERE t.rn = 1
AND t.state = 'A';
我删除了NULLS LAST
,假设created_at
定义为NOT NULL
。另外,我认为 Redshift 没有它:
- PostgreSQL sort by datetime asc, null first?
这两个查询都应该适用于 Redshift 的有限功能。使用现代 Postgres,有更好的选择:
- Select first row in each GROUP BY group?
- Optimize GROUP BY query to retrieve latest record per user
如果最新的行匹配,您的原始文件将 return 所有 行每 user_id
。你将不得不折叠重复的,不必要的工作......