根据最近的 state/attribute 值检索记录

Question

在 Redshift 中采用非规范化结构，计划继续创建记录，并且在检索时仅考虑针对用户的最新属性。

以下是table：

user_id   state  created_at
1         A      15-10-2015 02:00:00 AM
2         A      15-10-2015 02:00:01 AM
3         A      15-10-2015 02:00:02 AM
1         B      15-10-2015 02:00:03 AM
4         A      15-10-2015 02:00:04 AM
5         B      15-10-2015 02:00:05 AM

所需的结果集是：

user_id   state  created_at
2         A      15-10-2015 02:00:01 AM
3         A      15-10-2015 02:00:02 AM
4         A      15-10-2015 02:00:04 AM

我有检索上述结果的查询：

select user_id, first_value AS state
from (
   select user_id, first_value(state) OVER (
                     PARTITION BY user_id
                     ORDER BY created_at desc
                     ROWS between UNBOUNDED PRECEDING and CURRENT ROW)
   from customer_properties
   order by created_at) t
where first_value = 'A'

这是最好的检索方式还是可以改进查询？

Answer 1

最佳查询取决于各种细节：查询谓词的选择性、基数、数据分布。如果 state = 'A' 是一个选择性条件（查看行符合条件），这个查询应该快得多：

SELECT c.user_id, c.state
FROM   customer_properties c
LEFT   JOIN customer_properties c1 ON c1.user_id = c.user_id
                                  AND c1.created_at > c.created_at
WHERE  c.state = 'A'
AND    c1.user_id IS NULL;

提供，在(state)（甚至(state, user_id, created_at)）上有一个索引，在(user_id, created_at).[=26=上有一个索引]

有多种方法可以确保该行的更新版本不存在：

Select rows which are not present in other table

如果 'A' 是 state 中的常用值，这个更通用的查询会更快：

SELECT user_id, state
FROM (
   SELECT user_id, state
        , row_number() OVER (PARTITION BY user_id ORDER BY created_at DESC) AS rn
   FROM   customer_properties
   ) t
WHERE  t.rn = 1
AND    t.state = 'A';

我删除了NULLS LAST，假设created_at定义为NOT NULL。另外，我认为 Redshift 没有它：

PostgreSQL sort by datetime asc, null first?

这两个查询都应该适用于 Redshift 的有限功能。使用现代 Postgres，有更好的选择：

Select first row in each GROUP BY group?
Optimize GROUP BY query to retrieve latest record per user

如果最新的行匹配，您的原始文件将 return 所有行每 user_id。你将不得不折叠重复的，不必要的工作......

根据最近的 state/attribute 值检索记录

Retrieve records against most recent state/attribute value

sql

postgresql

greatest-n-per-group

amazon-redshift