使用日期时间查询防止大规模顺序扫描

Question

如何按日期过滤查询以防止对大型数据库进行大量顺序扫描？

我的调查应用程序收集 responses，调查中每个问题的答案都存储在 table response_answer 中。

当我查询所有 response_answers 月份时，我按日期过滤；但是 postgres 运行对所有 response_answers（数以百万计）进行顺序扫描，而且速度很慢。

查询：

explain analyse 
  select count(*)
    from response_answer
    left join response r on r.id = response_answer.response_id
    where r.date_recorded between '2019-08-01T00:00:00.000Z' and '2019-08-29T23:59:59.999Z';

QUERY PLAN
Aggregate  (cost=517661.09..517661.10 rows=1 width=8) (actual time=139362.882..139362.899 rows=1 loops=1)
  ->  Hash Join  (cost=8063.39..517565.30 rows=38316 width=0) (actual time=126512.031..136806.093 rows=316558 loops=1)
        Hash Cond: (response_answer.response_id = r.id)
        ->  Seq Scan on response_answer  (cost=0.00..480365.73 rows=7667473 width=4) (actual time=1.443..70216.817 rows=7667473 loops=1)
        ->  Hash  (cost=8053.35..8053.35 rows=803 width=4) (actual time=173.467..173.476 rows=7010 loops=1)
              Buckets: 8192 (originally 1024)  Batches: 1 (originally 1)  Memory Usage: 311kB
              ->  Seq Scan on response r  (cost=0.00..8053.35 rows=803 width=4) (actual time=0.489..107.417 rows=7010 loops=1)
                    Filter: ((date_recorded >= '2019-08-01'::date) AND (observed_at <= '2019-08-29'::date))
                    Rows Removed by Filter: 153682
Planning time: 21.310 ms
Execution time: 139373.365 ms

我在 response_answer(response_id)、response_answer(id) 和 response(id) 上有索引。

随着系统的增长，此查询将变得非常慢以致于无法使用，因为顺序扫描将继续花费更长的时间。

在处理大量数据时，我应该如何设计queries/tables，这样数据库就不必运行顺序扫描每一个。单身的。排。在找到所有相关 response_answers?

之前，Postgres 肯定有一种方法只考虑日期范围内的响应

Answer 1

您需要

上的索引

response (date_recorded, id)

和

response_answer (response_id)

VACUUM 仅索引扫描的表。

对于这样的查询，您不需要外部联接。 PostgreSQL 足够聪明，可以从 response.id 不能是 NULL.

的事实中推断出

使用日期时间查询防止大规模顺序扫描

Prevent massive sequential scan with datetime query

postgresql

indexing

database-optimization

如何按日期过滤查询以防止对大型数据库进行大量顺序扫描？