PostgreSQL:优化 select 的重叠查询

PostgreSQL: optimization of query for select with overlap

我为具有重叠周期的选择数据创建了以下查询(对于具有相同业务标识符的活动!):

select 
   campaign_instance_1.campaign_id,
   campaign_instance_1.start_time
from campaign_instance as campaign_instance_1 
  inner join campaign_instance as campaign_instance_2
  on campaign_instance_1.campaign_id = campaign_instance_2.campaign_id 
  and (
      (campaign_instance_1.start_time between campaign_instance_2.start_time and campaign_instance_2.finish_time)
   or (campaign_instance_1.finish_time between campaign_instance_2.start_time and campaign_instance_2.finish_time)
   or (campaign_instance_1.start_time<campaign_instance_2.start_time and campaign_instance_1.finish_time>campaign_instance_2.finish_time)
   or (campaign_instance_1.start_time>campaign_instance_2.start_time and campaign_instance_1.finish_time<campaign_instance_2.finish_time))

带索引,创建为:

 CREATE INDEX IF NOT EXISTS camp_inst_idx_campaign_id_and_finish_time
   ON public.campaign_instance_without_index USING btree
   (campaign_id ASC NULLS LAST, finish_time DESC NULLS LAST)
   TABLESPACE pg_default;

它已经在 100 000 行上运行得非常慢 - 43 秒!

为了优化,我尝试在 start_time 上添加索引:

  (campaign_id ASC NULLS LAST, finish_time DESC NULLS LAST, start_time DESC NULLS LAST)

但结果是一样的

据我了解解释分析的结果,索引“start_time”不用作索引条件:

我尝试使用此索引进行查询,其中包含 10 000 行和 100 000 行 - 因此,它尽可能不依赖于样本量(至少在这个尺度上)。

来源 table 包含以下结构:

campaign_id bigint,
fire_time bigint,
start_time bigint,
finish_time bigint,
recap character varying,
details json

为什么我的索引没有被使用,有什么可能的方法来改进查询?

加入 campaign_instance(本身)除了进行“存在”检查之外并没有真正起到任何作用,而且您的意图可能不是取回匹配记录的重复项。因此,您可以使用 EXISTS 或 LATERAL 连接简化查询。您的准时加入条件也可以简化,您似乎在寻找重叠时间:

select campaign_id,start_time
from campaign_instance c1
    where exists( select * from campaign_instance c2
        where c1.campaign_id = c2.campaign_id
  and (c1.start_time <= c2.finish_time and c1.finish_time >= c2.start_time));

那个时间重叠可能会使用 < 和 > 而不是 <= 和 >= 但我不知道你的确切要求,between 是隐含地说它是 <= 和 >=。

编辑:确保匹配项不是行本身: (这个 table 应该有一个主键来使事情变得更容易,但因为它没有,我假设 campaign_id、start_time 和 finish_time 上没有重复并且可以用作复合键)

select campaign_id,start_time
from campaign_instance c1
    where exists( select * from campaign_instance c2
        where c1.campaign_id = c2.campaign_id
  and (c1.start_time != c2.start_time or c1.finish_time != c2.finish_time)
  and (c1.start_time <= c2.finish_time and c1.finish_time >= c2.start_time));

这在我的系统(iMac i5 7500、3.4 Ghz、64 Gb 内存)上大约需要 230-250 毫秒。