在函数扫描的情况下,如何估计 `explain analyze` (postgresql) 中的 `rows` 参数?

How is `rows` param in `explain analyze` (postgresql) estimated in case of a function scan?

我正在检查一个复杂查询的部分执行计划并想出了这个:

postgres=# explain analyze                                                                                                                                                
select * from generate_series(
            (CURRENT_DATE)::timestamp without time zone,
            (CURRENT_DATE + '14 days'::interval),
            '1 day'::interval)
;
                                                    QUERY PLAN                                                     
-------------------------------------------------------------------------------------------------------------------
 Function Scan on generate_series  (cost=0.01..10.01 rows=1000 width=8) (actual time=0.024..0.036 rows=15 loops=1)
 Planning Time: 0.031 ms
 Execution Time: 0.064 ms
(3 rows)


A​​FAIK,postgresql 根据给定 table 的 reltuples 大小估计行,这是可以理解的。

鉴于上面提到的generate_series实际上生成了14行,那么在函数扫描的情况下rows=1000来自哪里?

根据 documentation

For those interested in further details, estimation of the size of a table (before any WHERE clauses) is done in src/backend/optimizer/util/plancat.c. The generic logic for clause selectivities is in src/backend/optimizer/path/clausesel.c. The operator-specific selectivity functions are mostly found in src/backend/utils/adt/selfuncs.c.

这是计算函数估计值的函数:

/*
  * function_selectivity
  *
  * Returns the selectivity of a specified boolean function clause.
  * This code executes registered procedures stored in the
  * pg_proc relation, by calling the function manager.
  *
  * See clause_selectivity() for the meaning of the additional parameters.
  */
 Selectivity
 function_selectivity(PlannerInfo *root,
                      Oid funcid,
                      List *args,
                      Oid inputcollid,
                      bool is_join,
                      int varRelid,
                      JoinType jointype,
                      SpecialJoinInfo *sjinfo)
 {

看起来这个 C 函数将读取 pg_proc 系统目录中的数据,我们有:

postgres=# select proname, prosupport, prorows 
           from pg_proc 
           where proname like '%generate%';
           proname            |          prosupport          | prorows 
------------------------------+------------------------------+---------
 generate_subscripts          | -                            |    1000
 generate_subscripts          | -                            |    1000
 generate_series              | generate_series_int4_support |    1000
 generate_series              | generate_series_int4_support |    1000
 generate_series_int4_support | -                            |       0
 generate_series              | generate_series_int8_support |    1000
 generate_series              | generate_series_int8_support |    1000
 generate_series_int8_support | -                            |       0
 generate_series              | -                            |    1000
 generate_series              | -                            |    1000
 generate_series              | -                            |    1000
 generate_series              | -                            |    1000
(12 rows)

看起来 pg_proc.prorows 列是检索到的估计值。

接受数字参数的 generate_series 函数有一个 "support function",它会查看参数,然后告诉规划器需要多少行。但是处理时间戳的没有这样的支持功能。相反,它只是估计它 returns 行数,默认为 1000。

如果需要,您可以更改此估算值:

alter function generate_series(timestamp without time zone, timestamp without time zone, interval) 
    rows 14;

但此更改将无法生存 pg_upgrade,也不会 dump/reload。

这是特定于版本的,因为支持功能仅在 v12 中实现。在此之前,即使是数字表格也总是计划在 1000 行上(或者为该功能设置的任何 prorows)。

作为一个明显的解决方法,您可以通过将查询包装在具有 LIMIT 的子查询中来欺骗计划器:


select * FROM (
        select * from generate_series(
            (CURRENT_DATE)::timestamp without time zone,
            (CURRENT_DATE + '14 days'::interval),
            '1 day'::interval)
        LIMIT 15;
        ) xxx
    ;

有什么意义。这是过早优化的一个非常明显的例子。在这种情况下,结果是在大脑意识到你之前 "pressed" 运行。

The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.

Donald Knuth,计算机编程艺术,1962 年。

看来它在今天至少和那时一样大。