Redshift EXCEPT 比 LEFT JOIN 慢得多
Redshift EXCEPT much slower than LEFT JOIN
我正在尝试将暂存 table ('new data') 与另一个 table ('existing data') 进行比较以识别 added/changed/removed 行,并最终一个更新。这是一项昂贵的操作 - 大型数据集上的完整差异。为了语法清晰,我真的很想使用 EXCEPT
命令,但我遇到了严重的性能问题,发现 LEFT JOIN
更好。
两个 table 具有相似的行数和相同的架构(几乎 - 'second' table 有一个额外的 created_date
列)。
他们共享 distkey(date)
和 sortkey(date, id1, id2)
;我什至在 EXCEPT
语句中指定 'correct' 顺序的列来帮助优化器。
每个测试大小的数据子集的查询计划如下。
explain
select date, id1, id2, id3, value, attr1, attr2, attr3 from new_data
except select date, id1, id2, id3, value, attr1, attr2, attr3 from existing_data;
XN SetOp Except (cost=1000002817944.78..1000003266822.61 rows=1995013 width=1637)
-> XN Sort (cost=1000002817944.78..1000002867820.09 rows=19950126 width=1637)
Sort Key: date, id1, id2, id3, value, attr1, attr2, attr3
-> XN Append (cost=0.00..399002.52 rows=19950126 width=1637)
-> XN Subquery Scan "*SELECT* 1" (cost=0.00..199501.26 rows=9975063 width=1637)
-> XN Seq Scan on new_data (cost=0.00..99750.63 rows=9975063 width=1637)
-> XN Subquery Scan "*SELECT* 2" (cost=0.00..199501.26 rows=9975063 width=1636)
-> XN Seq Scan on existing_data (cost=0.00..99750.63 rows=9975063 width=1636)
对比一下我的丑多了LEFT JOIN
explain
select t1.* from new_data t1
left outer join existing_data t2 on
t1.date = t2.date
and t1.id1 = t2.id1
and coalesce(t1.id2, -1) = coalesce(t2.id2, -1)
and coalesce(t1.id3, -1) = coalesce(t2.id3, -1)
and coalesce(t1.value, -1) = coalesce(t2.value, -1)
and coalesce(t1.attr1, '') = coalesce(t2.attr1, '')
and coalesce(t1.attr2, '') = coalesce(t2.attr2, '')
and coalesce(t1.attr3, '') = coalesce(t2.attr3, '')
where t2.id1 is null;
XN Merge Left Join DS_DIST_NONE (cost=0.00..68706795.68 rows=9975063 width=1637)
Merge Cond: (("outer".date = "inner".date) AND (("outer".id1)::bigint = "inner".id1))
Join Filter: (((COALESCE("outer".id2, -1))::bigint = COALESCE("inner".id2, -1::bigint)) AND ((COALESCE("outer".id3, -1))::bigint = COALESCE("inner".id3, -1::bigint)) AND ((COALESCE("outer".value, -1::numeric))::double precision = COALESCE("inner".value, -1::double precision)) AND ((COALESCE("outer".attr1, ''::character varying))::text = (COALESCE("inner".attr1, ''::character varying))::text) AND ((COALESCE("outer".attr2, ''::character varying))::text = (COALESCE("inner".attr2, ''::character varying))::text) AND ((COALESCE("outer".attr3, ''::character varying))::text = (COALESCE("inner".attr3, ''::character varying))::text))
Filter: ("inner".id1 IS NULL)
-> XN Seq Scan on new_data t1 (cost=0.00..99750.63 rows=9975063 width=1637)
-> XN Seq Scan on existing_data t2 (cost=0.00..99750.63 rows=9975063 width=1636)
查询成本为 1000003266822.61
与 68706795.68
。我知道我不应该跨查询进行比较,但它在执行时间中得到了证明。知道为什么 EXCEPT
语句比 LEFT JOIN
慢得多吗?
left join
正在为每个(可能是有序的)键值生成一堆交叉连接的行,然后通过 on
过滤掉不需要的行;它也可以在(可能是有序的)旧键值超过新键值时停止,因为不再有匹配项——这还涉及通过一些 coalesce
SARG 智能进行的一些推理。 except
首先对所有内容进行排序。在这种情况下,排序的成本高于生成和丢弃行,乘以遍历右侧每个键的行 table。当然,优化器 可以 在其 except
计划中包含一个 outer join
习语——但它显然没有。
相关:PostgreSQL: NOT IN versus EXCEPT performance difference
我正在尝试将暂存 table ('new data') 与另一个 table ('existing data') 进行比较以识别 added/changed/removed 行,并最终一个更新。这是一项昂贵的操作 - 大型数据集上的完整差异。为了语法清晰,我真的很想使用 EXCEPT
命令,但我遇到了严重的性能问题,发现 LEFT JOIN
更好。
两个 table 具有相似的行数和相同的架构(几乎 - 'second' table 有一个额外的 created_date
列)。
他们共享 distkey(date)
和 sortkey(date, id1, id2)
;我什至在 EXCEPT
语句中指定 'correct' 顺序的列来帮助优化器。
每个测试大小的数据子集的查询计划如下。
explain
select date, id1, id2, id3, value, attr1, attr2, attr3 from new_data
except select date, id1, id2, id3, value, attr1, attr2, attr3 from existing_data;
XN SetOp Except (cost=1000002817944.78..1000003266822.61 rows=1995013 width=1637)
-> XN Sort (cost=1000002817944.78..1000002867820.09 rows=19950126 width=1637)
Sort Key: date, id1, id2, id3, value, attr1, attr2, attr3
-> XN Append (cost=0.00..399002.52 rows=19950126 width=1637)
-> XN Subquery Scan "*SELECT* 1" (cost=0.00..199501.26 rows=9975063 width=1637)
-> XN Seq Scan on new_data (cost=0.00..99750.63 rows=9975063 width=1637)
-> XN Subquery Scan "*SELECT* 2" (cost=0.00..199501.26 rows=9975063 width=1636)
-> XN Seq Scan on existing_data (cost=0.00..99750.63 rows=9975063 width=1636)
对比一下我的丑多了LEFT JOIN
explain
select t1.* from new_data t1
left outer join existing_data t2 on
t1.date = t2.date
and t1.id1 = t2.id1
and coalesce(t1.id2, -1) = coalesce(t2.id2, -1)
and coalesce(t1.id3, -1) = coalesce(t2.id3, -1)
and coalesce(t1.value, -1) = coalesce(t2.value, -1)
and coalesce(t1.attr1, '') = coalesce(t2.attr1, '')
and coalesce(t1.attr2, '') = coalesce(t2.attr2, '')
and coalesce(t1.attr3, '') = coalesce(t2.attr3, '')
where t2.id1 is null;
XN Merge Left Join DS_DIST_NONE (cost=0.00..68706795.68 rows=9975063 width=1637)
Merge Cond: (("outer".date = "inner".date) AND (("outer".id1)::bigint = "inner".id1))
Join Filter: (((COALESCE("outer".id2, -1))::bigint = COALESCE("inner".id2, -1::bigint)) AND ((COALESCE("outer".id3, -1))::bigint = COALESCE("inner".id3, -1::bigint)) AND ((COALESCE("outer".value, -1::numeric))::double precision = COALESCE("inner".value, -1::double precision)) AND ((COALESCE("outer".attr1, ''::character varying))::text = (COALESCE("inner".attr1, ''::character varying))::text) AND ((COALESCE("outer".attr2, ''::character varying))::text = (COALESCE("inner".attr2, ''::character varying))::text) AND ((COALESCE("outer".attr3, ''::character varying))::text = (COALESCE("inner".attr3, ''::character varying))::text))
Filter: ("inner".id1 IS NULL)
-> XN Seq Scan on new_data t1 (cost=0.00..99750.63 rows=9975063 width=1637)
-> XN Seq Scan on existing_data t2 (cost=0.00..99750.63 rows=9975063 width=1636)
查询成本为 1000003266822.61
与 68706795.68
。我知道我不应该跨查询进行比较,但它在执行时间中得到了证明。知道为什么 EXCEPT
语句比 LEFT JOIN
慢得多吗?
left join
正在为每个(可能是有序的)键值生成一堆交叉连接的行,然后通过 on
过滤掉不需要的行;它也可以在(可能是有序的)旧键值超过新键值时停止,因为不再有匹配项——这还涉及通过一些 coalesce
SARG 智能进行的一些推理。 except
首先对所有内容进行排序。在这种情况下,排序的成本高于生成和丢弃行,乘以遍历右侧每个键的行 table。当然,优化器 可以 在其 except
计划中包含一个 outer join
习语——但它显然没有。
相关:PostgreSQL: NOT IN versus EXCEPT performance difference