Redshift 可以避免使用 sortkey 和加入 table 的完整 table 扫描

Question

我在 Redshift 中有一个非常大的 table "event" 和一个小得多的 table "d_date" 代表日期。 Redshift 将运行在 "event" 上对下面的 SQL 进行完整的 table 扫描，除非我取消注释注释部分。 Table 事件将 date_id 作为其排序键。

为什么 Redshift 没有发现先扫描 d_date 然后通过匹配值限制事件 table 扫描的成本要低得多？

select d_date.date_id, count(*)
from d_date
join event on d_date.date_id = event.date_id
where d_date.sqldate > '2016-06-03'
/* without this the query will do a full table scan and run very slow */
/* and d_date.date_id > 20160603 */
group by 1;

这是 EXPLAIN 输出：

QUERY PLAN
XN HashAggregate  (cost=19673968.12..19673971.77 rows=1460 width=4)
->  XN Hash Join DS_DIST_ALL_NONE  (cost=78.63..18758349.28 rows=183123769 width=4)
    Hash Cond: ("outer".date_id = "inner".date_id)
    ->  XN Seq Scan on event  (cost=0.00..7523125.76 rows=752312576 width=4)
    ->  XN Hash  (cost=74.98..74.98 rows=1460 width=4)
          ->  XN Seq Scan on d_date  (cost=0.00..74.98 rows=1460 width=4)
                Filter: (sqldate > '2016-06-03'::date)

取消注释的部分 table 阶段将如下所示：

    ->  XN Seq Scan on event  (cost=0.00..928.32 rows=74266 width=4)

我已经对 table 进行了 VACUUMed 和 ANALYZEd，并且设置了主键和外键。

Answer 1

Amazon Redshift 文档在 Amazon Redshift Best Practices for Designing Queries:

中专门解决了这个主题

If possible, use a WHERE clause based on the primary sort column of the largest table in the query to restrict the dataset. The query planner can then use row order to help determine which records match the criteria, so it can skip scanning large numbers of disk blocks. Without this, the query execution engine must scan the entire table.

Add predicates to filter tables that participate in joins, even if the predicates apply the same filters. The query returns the same result set, but Amazon Redshift is able to filter the join tables before the scan step and can then efficiently skip scanning blocks from those tables.

For example, suppose you want to join SALES and LISTING to find ticket sales for tickets listed after December, grouped by seller. Both tables are sorted by date. The following query joins the tables on their common key and filters for listing.listtime values greater than December 1:

select listing.sellerid, sum(sales.qtysold)
from sales, listing
where sales.salesid = listing.listid
and listing.listtime > '2008-12-01'
group by 1 order by 1;

The WHERE clause doesn't include a predicate for sales.saletime, so the execution engine is forced to scan the entire SALES table. If you know the filter would result in fewer rows participating in the join, then add that filter as well. The following example cuts execution time significantly:

select listing.sellerid, sum(sales.qtysold)
from sales, listing
where sales.salesid = listing.listid
and listing.listtime > '2008-12-01'
and sales.saletime > '2008-12-01'
group by 1 order by 1;

Redshift 可以避免使用 sortkey 和加入 table 的完整 table 扫描

Redshift could avoid full table scan using sortkey and joined table

sql

amazon-redshift