PostgreSQL/GreenPlum分区消除和左连接

Question

是否可以使用分区消除与左外连接到分区 table？

我的理解是，分区消除仅在分区键位于 where 子句中时才有效，因此 where right_table.date_key = '2016-02-01' 会进行分区消除，但这与左连接不兼容，因为它会消除任何行right_table.

中不存在

如果我输入 where (right_table.date_key = '2016-02-02' or right_table.date_key is null) 则它不会进行任何分区消除。

我被要求 post 完整的查询，所以这里有一个精简版（真实的东西很大，有几十列，还有几个 tables，一些大案例报表和机密的客户业务逻辑）：

select voyage.std -- timestamp
     , person.name
     , fact1.score score_1
     , fact2.score score_2
from fact1
     join voyage on voyage.voyage_sk = fact1.voyage_sk
     join person on person.person_sk = fact1.person_sk
left join fact2  on fact2.person_sk  = person.person_sk
where voyage.std = '2016-02-02 14:33:00'

所以fact1总是存在的，但是fact2是可选的。 None 个 table 已分区。

现在为了分区，我添加了一个新列 voyage_sdd，它是 voyage.std 的日期部分。我在新日期列上划分事实 tables 和航程 table。查询然后变成这样：

select voyage.std -- timestamp
     , person.name
     , fact1.score score_1
     , fact2.score score_2
from fact1
     join voyage on voyage.voyage_sk = fact1.voyage_sk
     join person on person.person_sk = fact1.person_sk
left join fact2  on fact2.person_sk  = person.person_sk
where voyage.std = '2016-02-02 14:33:00'
and voyage.voyage_sdd = '2016-02-02'
and fact1.voyage_sdd = '2016-02-02'
and fact2.voyage_sdd = '2016-02-02'

最后一行使 fact2 成为内部联接。如果我离开最后一行，那么查询仍然有效并且 returns 正确的数据，但它比非分区查询效率低，因为它必须扫描所有分区。如果我让 fact2 未分区，那么我在只有一个小数据集的测试环境中会得到轻微的性能改进，我希望当我们获得更多磁盘 space 和代表时这会有所改善测试中的数据量。

所以重申一下我的问题，我怎样才能对 fact2 进行分区并且仍然有一个左连接？

更新这有效：

select voyage.std -- timestamp
     , person.name
     , fact1.score score_1
     , fact2.score score_2
from voyage 
     join person on person.person_sk = fact1.person_sk
     join fact1  on fact1.voyage_sk  = voyage.voyage_sk and fact1.voyage_sdd = voyage.voyage_sdd
left join fact2  on fact2.person_sk  = person.person_sk and fact2.voyage_sdd = voyage.voyage_sdd
where voyage.std = '2016-02-02 14:33:00'
and voyage.voyage_sdd = '2016-02-02'

优化器知道 fact2（和 fact1）table 是在连接键上分区的，并且由于航程 table 对连接键有约束，所以事实 table分区可以消除。

Answer 1

首先，where (right_table.date_key = '2016-02-02' or right_table.date_key is null) NULL 的 or 条件可能是阻止分区消除的问题。

其次，针对"how to partition f2"的问题。大多数时候，我总是在 'date' 上进行分区，因为大多数 DW 查询都会有一个谓词来缩小 'date'。就像你在最后一行所做的那样 fact2.voyage_sdd = '2016-02-02'.

此外，如果符合您的业务逻辑，我会将所有分区列包含在 'join' 列中。在那种情况下，如果优化器支持通过连接动态分区消除，如 GPORCA (http://pivotal.io/big-data/white-paper/optimizing-queries-over-partitioned-tables-in-mpp-systems)，那么您可以从中受益。

希望能回答您的问题。

Answer 2

你问的是不可能的。条件 where (right_table.date_key = '2016-02-02' or right_table.date_key is null) 换句话说就是 The date is '2016-02-02' or no other record exists)。所以我们不能只局限于那个 table.

如果你真正想要的不是 left join fact2 on fact2.person_sk = person.person_sk and fact2.voyage_sdd = '2016-02-02'

您最好的办法是尝试通过以其他方式编写查询来获得更好的计划，例如：

select voyage.std -- timestamp
     , person.name
     , fact1.score score_1
     , fact2.score score_2
from fact1
join voyage on voyage.voyage_sk = fact1.voyage_sk
join person on person.person_sk = fact1.person_sk
left join fact2 on fact2.person_sk = person.person_sk
   AND fact2.voyage_sdd = '2016-02-02'
where voyage.std = '2016-02-02 14:33:00'
and voyage.voyage_sdd = '2016-02-02'
and fact1.voyage_sdd = '2016-02-02'
and (fact2.voyage_sdd = '2016-02-02' OR NOT EXISTS (SELECT * FROM fact2 WHERE fact2.person_sk = person.person_sk)

PostgreSQL/GreenPlum分区消除和左连接

PostgreSQL/GreenPlum partition elimination and left join

postgresql

greenplum