在函数扫描的情况下,如何估计 `explain analyze` (postgresql) 中的 `rows` 参数?
How is `rows` param in `explain analyze` (postgresql) estimated in case of a function scan?
我正在检查一个复杂查询的部分执行计划并想出了这个:
postgres=# explain analyze
select * from generate_series(
(CURRENT_DATE)::timestamp without time zone,
(CURRENT_DATE + '14 days'::interval),
'1 day'::interval)
;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Function Scan on generate_series (cost=0.01..10.01 rows=1000 width=8) (actual time=0.024..0.036 rows=15 loops=1)
Planning Time: 0.031 ms
Execution Time: 0.064 ms
(3 rows)
AFAIK,postgresql 根据给定 table 的 reltuples 大小估计行,这是可以理解的。
鉴于上面提到的generate_series
实际上生成了14行,那么在函数扫描的情况下rows=1000
来自哪里?
根据 documentation:
For those interested in further details, estimation of the size of a
table (before any WHERE clauses) is done in
src/backend/optimizer/util/plancat.c. The generic logic for clause
selectivities is in src/backend/optimizer/path/clausesel.c. The
operator-specific selectivity functions are mostly found in
src/backend/utils/adt/selfuncs.c.
这是计算函数估计值的函数:
/*
* function_selectivity
*
* Returns the selectivity of a specified boolean function clause.
* This code executes registered procedures stored in the
* pg_proc relation, by calling the function manager.
*
* See clause_selectivity() for the meaning of the additional parameters.
*/
Selectivity
function_selectivity(PlannerInfo *root,
Oid funcid,
List *args,
Oid inputcollid,
bool is_join,
int varRelid,
JoinType jointype,
SpecialJoinInfo *sjinfo)
{
看起来这个 C 函数将读取 pg_proc 系统目录中的数据,我们有:
postgres=# select proname, prosupport, prorows
from pg_proc
where proname like '%generate%';
proname | prosupport | prorows
------------------------------+------------------------------+---------
generate_subscripts | - | 1000
generate_subscripts | - | 1000
generate_series | generate_series_int4_support | 1000
generate_series | generate_series_int4_support | 1000
generate_series_int4_support | - | 0
generate_series | generate_series_int8_support | 1000
generate_series | generate_series_int8_support | 1000
generate_series_int8_support | - | 0
generate_series | - | 1000
generate_series | - | 1000
generate_series | - | 1000
generate_series | - | 1000
(12 rows)
看起来 pg_proc.prorows 列是检索到的估计值。
接受数字参数的 generate_series
函数有一个 "support function",它会查看参数,然后告诉规划器需要多少行。但是处理时间戳的没有这样的支持功能。相反,它只是估计它 returns 行数,默认为 1000。
如果需要,您可以更改此估算值:
alter function generate_series(timestamp without time zone, timestamp without time zone, interval)
rows 14;
但此更改将无法生存 pg_upgrade,也不会 dump/reload。
这是特定于版本的,因为支持功能仅在 v12 中实现。在此之前,即使是数字表格也总是计划在 1000 行上(或者为该功能设置的任何 prorows)。
作为一个明显的解决方法,您可以通过将查询包装在具有 LIMIT 的子查询中来欺骗计划器:
select * FROM (
select * from generate_series(
(CURRENT_DATE)::timestamp without time zone,
(CURRENT_DATE + '14 days'::interval),
'1 day'::interval)
LIMIT 15;
) xxx
;
有什么意义。这是过早优化的一个非常明显的例子。在这种情况下,结果是在大脑意识到你之前 "pressed" 运行。
The real problem is that programmers have spent far too much time
worrying about efficiency in the wrong places and at the wrong times;
premature optimization is the root of all evil (or at least most of
it) in programming.
Donald Knuth,计算机编程艺术,1962 年。
看来它在今天至少和那时一样大。
我正在检查一个复杂查询的部分执行计划并想出了这个:
postgres=# explain analyze
select * from generate_series(
(CURRENT_DATE)::timestamp without time zone,
(CURRENT_DATE + '14 days'::interval),
'1 day'::interval)
;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Function Scan on generate_series (cost=0.01..10.01 rows=1000 width=8) (actual time=0.024..0.036 rows=15 loops=1)
Planning Time: 0.031 ms
Execution Time: 0.064 ms
(3 rows)
AFAIK,postgresql 根据给定 table 的 reltuples 大小估计行,这是可以理解的。
鉴于上面提到的generate_series
实际上生成了14行,那么在函数扫描的情况下rows=1000
来自哪里?
根据 documentation:
For those interested in further details, estimation of the size of a table (before any WHERE clauses) is done in src/backend/optimizer/util/plancat.c. The generic logic for clause selectivities is in src/backend/optimizer/path/clausesel.c. The operator-specific selectivity functions are mostly found in src/backend/utils/adt/selfuncs.c.
这是计算函数估计值的函数:
/*
* function_selectivity
*
* Returns the selectivity of a specified boolean function clause.
* This code executes registered procedures stored in the
* pg_proc relation, by calling the function manager.
*
* See clause_selectivity() for the meaning of the additional parameters.
*/
Selectivity
function_selectivity(PlannerInfo *root,
Oid funcid,
List *args,
Oid inputcollid,
bool is_join,
int varRelid,
JoinType jointype,
SpecialJoinInfo *sjinfo)
{
看起来这个 C 函数将读取 pg_proc 系统目录中的数据,我们有:
postgres=# select proname, prosupport, prorows
from pg_proc
where proname like '%generate%';
proname | prosupport | prorows
------------------------------+------------------------------+---------
generate_subscripts | - | 1000
generate_subscripts | - | 1000
generate_series | generate_series_int4_support | 1000
generate_series | generate_series_int4_support | 1000
generate_series_int4_support | - | 0
generate_series | generate_series_int8_support | 1000
generate_series | generate_series_int8_support | 1000
generate_series_int8_support | - | 0
generate_series | - | 1000
generate_series | - | 1000
generate_series | - | 1000
generate_series | - | 1000
(12 rows)
看起来 pg_proc.prorows 列是检索到的估计值。
接受数字参数的 generate_series
函数有一个 "support function",它会查看参数,然后告诉规划器需要多少行。但是处理时间戳的没有这样的支持功能。相反,它只是估计它 returns 行数,默认为 1000。
如果需要,您可以更改此估算值:
alter function generate_series(timestamp without time zone, timestamp without time zone, interval)
rows 14;
但此更改将无法生存 pg_upgrade,也不会 dump/reload。
这是特定于版本的,因为支持功能仅在 v12 中实现。在此之前,即使是数字表格也总是计划在 1000 行上(或者为该功能设置的任何 prorows)。
作为一个明显的解决方法,您可以通过将查询包装在具有 LIMIT 的子查询中来欺骗计划器:
select * FROM (
select * from generate_series(
(CURRENT_DATE)::timestamp without time zone,
(CURRENT_DATE + '14 days'::interval),
'1 day'::interval)
LIMIT 15;
) xxx
;
有什么意义。这是过早优化的一个非常明显的例子。在这种情况下,结果是在大脑意识到你之前 "pressed" 运行。
The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.
Donald Knuth,计算机编程艺术,1962 年。
看来它在今天至少和那时一样大。