Redshift 查询执行计划

Redshift Query Execution Plan

我注意到下面的查询运行缓慢,在详细查看之后,我想知道为什么 Redshift 首先分别扫描两个 tables(事件和联系人),然后将它们连接在一起。联系人 table 中有超过 300,000 行。 我的期望是 Redshift 应该首先根据为其指定的过滤器扫描大型事件 table,然后根据 Contact_IDs 列在其中找到联系人。我的期望不正确吗?我还能做些什么来加快查询速度吗?我对所有 table 执行了 Vacuum 和 Analyze。

查询:

select c.Segment
, Count (Distinct (CASE WHEN et.Event_ID = 1 THEN et.Contact_ID ELSE null END)) as L1
, Count (Distinct (CASE WHEN et.Event_ID = 2 THEN et.Contact_ID ELSE null END)) as L2
from
Events et 
jon contact c on c.Account_ID = et.Account_ID and c.ID = et.Contact_ID
where
et.Account_ID = 5
and et.Event_ID in (1, 2)
and et.IsGuest = 0
and et.dim_date_id >=20151125 
and et.dim_date_id <=20160226
group by c.Segment
order by 1

说明:

XN Merge (cost=1000000074927.82..1000000074927.83 rows=1 width=20)
-> XN Network (cost=1000000074927.82..1000000074927.83 rows=1 width=20)
-> XN Sort (cost=1000000074927.82..1000000074927.83 rows=1 width=20)
-> XN HashAggregate (cost=74927.80..74927.81 rows=1 width=20)
-> XN Merge Join DS_DIST_NONE (cost=0.00..74927.57 rows=31 width=20)
-> XN Seq Scan on contact c (cost=0.00..497.56 rows=39805 width=16)
-> XN Seq Scan on eventtransaction et (cost=0.00..6664.84 rows=136 width=20)

仅在执行联接后应用过滤器。如果您希望在应用过滤器后加入,我建议您创建一个临时 table 并将其与您在代码中指示的联系人 table 加入。

select c.Segment
, Count (Distinct (CASE WHEN et.Event_ID = 1 THEN et.Contact_ID ELSE null END)) as L1
, Count (Distinct (CASE WHEN et.Event_ID = 2 THEN et.Contact_ID ELSE null END)) as L2
from
(
  select Event_ID, Account_ID, Contact_ID
  FROM event
  WHERE
    et.Account_ID = 5
    and et.Event_ID in (1, 2)
    and et.IsGuest = 0
    and et.dim_date_id >=20151125 
    and et.dim_date_id <=20160226
)et 
join contact c on c.Account_ID = et.Account_ID and c.ID = et.Contact_ID
group by c.Segment
order by 1

此外,如果您在 dim_date_id 上设置了排序键,您会发现此查询的速度得到了显着提高。可以找到有关相同内容的更多详细信息 here