提高重复查询的查询效率
Improve query efficiency for repetitive queries
我正在编写一个 node.js 应用程序来启用对 PostgreSQL 数据库的搜索。为了在搜索框中启用 twitter 预先输入,我必须从数据库中 c运行ch 一组关键字以在页面加载之前初始化 Bloodhound。如下所示:
SELECT distinct handlerid from lotintro where char_length(lotid)=7;
所以对于大 table (lotintro),这是昂贵的;这也很愚蠢,因为查询结果很可能在一段时间内对于不同的网络访问者保持不变。
处理这个问题的正确方法是什么?我在考虑几个选项:
1) 将查询放入存储过程并从 node.js:
中调用它
SELECT * from getallhandlerid()
这是否意味着查询将被编译并且数据库将自动 return 相同的结果集而无需实际 运行ning 查询知道结果不会改变?
2) 或者,创建一个单独的 table 来存储不同的 handlerid
并使用每天 运行s 的触发器更新 table? (我知道理想情况下,触发器应该 运行 每个 insert/update 到 table,但这成本太高了)。
3) 按照建议创建部分索引。这是收集的内容:
查询
SELECT distinct handlerid from lotintro where length(lotid) = 7;
索引
CREATE INDEX lotid7_idx ON lotintro (handlerid)
WHERE length(lotid) = 7;
有索引,查询耗时250ms左右,试试运行
explain (analyze on, TIMING OFF) SELECT distinct handlerid from lotintro where length(lotid) = 7
"HashAggregate (cost=5542.64..5542.65 rows=1 width=6) (actual rows=151 loops=1)"
" -> Bitmap Heap Scan on lotintro (cost=39.08..5537.50 rows=2056 width=6) (actual rows=298350 loops=1)"
" Recheck Cond: (length(lotid) = 7)"
" Rows Removed by Index Recheck: 55285"
" -> Bitmap Index Scan on lotid7_idx (cost=0.00..38.57 rows=2056 width=0) (actual rows=298350 loops=1)"
"Total runtime: 243.686 ms"
没有索引,查询耗时210ms左右,试试运行
explain (analyze on, TIMING OFF) SELECT distinct handlerid from lotintro where length(lotid) = 7
"HashAggregate (cost=19490.11..19490.12 rows=1 width=6) (actual rows=151 loops=1)"
" -> Seq Scan on lotintro (cost=0.00..19484.97 rows=2056 width=6) (actual rows=298350 loops=1)"
" Filter: (length(lotid) = 7)"
" Rows Removed by Filter: 112915"
"Total runtime: 214.235 ms"
我做错了什么?
4) 使用 alexius 建议的索引和查询:
create index on lotintro using btree(char_length(lotid), handlerid);
但这不是最佳解决方案。因为只有几个不同的值,你可以使用称为松散索引扫描的技巧,在你的情况下它应该工作得更快:
explain (analyze on, BUFFERS on, TIMING OFF)
WITH RECURSIVE t AS (
(SELECT handlerid FROM lotintro WHERE char_length(lotid)=7 ORDER BY handlerid LIMIT 1) -- parentheses required
UNION ALL
SELECT (SELECT handlerid FROM lotintro WHERE char_length(lotid)=7 AND handlerid > t.handlerid ORDER BY handlerid LIMIT 1)
FROM t
WHERE t.handlerid IS NOT NULL
)
SELECT handlerid FROM t WHERE handlerid IS NOT NULL;
"CTE Scan on t (cost=444.52..446.54 rows=100 width=32) (actual rows=151 loops=1)"
" Filter: (handlerid IS NOT NULL)"
" Rows Removed by Filter: 1"
" Buffers: shared hit=608"
" CTE t"
" -> Recursive Union (cost=0.42..444.52 rows=101 width=32) (actual rows=152 loops=1)"
" Buffers: shared hit=608"
" -> Limit (cost=0.42..4.17 rows=1 width=6) (actual rows=1 loops=1)"
" Buffers: shared hit=4"
" -> Index Scan using lotid_btree on lotintro lotintro_1 (cost=0.42..7704.41 rows=2056 width=6) (actual rows=1 loops=1)"
" Index Cond: (char_length(lotid) = 7)"
" Buffers: shared hit=4"
" -> WorkTable Scan on t t_1 (cost=0.00..43.83 rows=10 width=32) (actual rows=1 loops=152)"
" Filter: (handlerid IS NOT NULL)"
" Rows Removed by Filter: 0"
" Buffers: shared hit=604"
" SubPlan 1"
" -> Limit (cost=0.42..4.36 rows=1 width=6) (actual rows=1 loops=151)"
" Buffers: shared hit=604"
" -> Index Scan using lotid_btree on lotintro (cost=0.42..2698.13 rows=685 width=6) (actual rows=1 loops=151)"
" Index Cond: ((char_length(lotid) = 7) AND (handlerid > t_1.handlerid))"
" Buffers: shared hit=604"
"Planning time: 1.574 ms"
**"Execution time: 25.476 ms"**
=========关于数据库的更多信息============================
dataloggerDB=# \d lotintro
Table"public.lotintro"
Column | Type | Modifiers
--------------+-----------------------------+--------------
lotstartdt | timestamp without time zone | not null
lotid | text | not null
ftc | text | not null
deviceid | text | not null
packageid | text | not null
testprogname | text | not null
testprogdir | text | not null
testgrade | text | not null
testgroup | text | not null
temperature | smallint | not null
testerid | text | not null
handlerid | text | not null
numofsite | text | not null
masknum | text |
soaktime | text |
xamsqty | smallint |
scd | text |
speedgrade | text |
loginid | text |
operatorid | text | not null
loadboardid | text | not null
checksum | text |
lotenddt | timestamp without time zone | not null
totaltest | integer | default (-1)
totalpass | integer | default (-1)
earnhour | real | default 0
avetesttime | real | default 0
Indexes:
"pkey_lotintro" PRIMARY KEY, btree (lotstartdt, testerid)
"lotid7_idx" btree (handlerid) WHERE length(lotid) = 7
your version of Postgres, [PostgreSQL 9.2]
cardinalities (how many rows?), [411K rows for table lotintro]
percentage for length(lotid) = 7. [298350/411000= 73%]
============= 在将所有内容移植到 PG 9.4 之后 =====================
有索引:
explain (analyze on, BUFFERS on, TIMING OFF) SELECT distinct handlerid from lotintro where length(lotid) = 7
"HashAggregate (cost=5542.78..5542.79 rows=1 width=6) (actual rows=151 loops=1)"
" Group Key: handlerid"
" Buffers: shared hit=14242"
" -> Bitmap Heap Scan on lotintro (cost=39.22..5537.64 rows=2056 width=6) (actual rows=298350 loops=1)"
" Recheck Cond: (length(lotid) = 7)"
" Heap Blocks: exact=13313"
" Buffers: shared hit=14242"
" -> Bitmap Index Scan on lotid7_idx (cost=0.00..38.70 rows=2056 width=0) (actual rows=298350 loops=1)"
" Buffers: shared hit=929"
"Planning time: 0.256 ms"
"Execution time: 154.657 ms"
没有索引:
explain (analyze on, BUFFERS on, TIMING OFF) SELECT distinct handlerid from lotintro where length(lotid) = 7
"HashAggregate (cost=19490.11..19490.12 rows=1 width=6) (actual rows=151 loops=1)"
" Group Key: handlerid"
" Buffers: shared hit=13316"
" -> Seq Scan on lotintro (cost=0.00..19484.97 rows=2056 width=6) (actual rows=298350 loops=1)"
" Filter: (length(lotid) = 7)"
" Rows Removed by Filter: 112915"
" Buffers: shared hit=13316"
"Planning time: 0.168 ms"
"Execution time: 176.466 ms"
您需要为 WHERE
子句中使用的确切表达式编制索引:http://www.postgresql.org/docs/9.4/static/indexes-expressional.html
CREATE INDEX char_length_lotid_idx ON lotintro (char_length(lotid));
您还可以创建一个 STABLE
或 IMMUTABLE
函数来按照您的建议包装此查询:http://www.postgresql.org/docs/9.4/static/sql-createfunction.html
你最后的建议也是可行的,你要找的是MATERIALIZED VIEWS
:http://www.postgresql.org/docs/9.4/static/sql-creatematerializedview.html
这会阻止您编写自定义触发器来更新非规范化 table.
1)
不,函数不会以任何方式保留结果的快照。如果您定义函数 STABLE
(这是正确的),则有 一些 性能优化的潜力。 Per documentation:
A STABLE
function cannot modify the database and is guaranteed to
return the same results given the same arguments for all rows within a
single statement.
IMMUTABLE
在这里 是错误的 并且可能导致错误。
所以这可以 极大地 使同一语句中的多个调用受益 - 但这不适合您的用例...
并且 plpgsql 函数的工作方式类似于 准备好的语句,在同一个 session:
中为您提供类似的奖励
- Difference between language sql and language plpgsql in PostgreSQL functions
2)
尝试 MATERIALIZED VIEW
. With or without MV (or some other caching technique), a partial index 对您的特殊情况最有效:
CREATE INDEX lotid7_idx ON lotintro (handlerid)
WHERE length(lotid) = 7;
记住在应该使用索引的查询中包含索引条件,即使这看起来多余:
- PostgreSQL does not use a partial index
但是,正如您提供的那样:
percentage for length(lotid) = 7. [298350/411000= 73%]
该索引只有在您可以从中进行仅索引扫描时才会有所帮助,因为该条件几乎没有选择性。由于 table 具有非常宽的行,因此仅索引扫描可以快得多。
松散索引扫描
此外,rows=298350
被折叠为 rows=151
,因此松散的索引扫描将支付费用,正如我在此处解释的那样:
- Optimize GROUP BY query to retrieve latest record per user
或者在Postgres Wiki——其实是基于这个post.
WITH RECURSIVE t AS (
(SELECT handlerid FROM lotintro
WHERE length(lotid) = 7
ORDER BY 1 LIMIT 1)
UNION ALL
SELECT (SELECT handlerid FROM lotintro
WHERE length(lotid) = 7
AND handlerid > t.handlerid
ORDER BY 1 LIMIT 1)
FROM t
WHERE t.handlerid IS NOT NULL
)
SELECT handlerid FROM t
WHERE handlerid IS NOT NULL;
这会更快,但是,与我建议的部分索引 结合使用。由于部分索引的大小只有原来的一半左右,而且更新频率较低(取决于访问模式),因此总体来说更便宜。
如果保持 table 真空以允许仅索引扫描,速度会更快。如果你有很多写入,你可以为这个 table 设置更积极的存储参数:
最后,您可以使用基于此查询的物化视图更快地完成此操作。
由于 3/4 的行满足您的条件 (length(lotid) = 7),索引本身不会有太大帮助。由于仅索引扫描,您可能会使用此索引获得更好的性能:
create index on lotintro using btree(char_length(lotid), handlerid);
但这不是最佳解决方案。因为只有几个不同的值,所以您可以使用称为 loose index scan 的技巧,在您的情况下它应该工作得更快:
WITH RECURSIVE t AS (
(SELECT handlerid FROM lotintro WHERE char_length(lotid)=7 ORDER BY handlerid LIMIT 1) -- parentheses required
UNION ALL
SELECT (SELECT handlerid FROM lotintro WHERE char_length(lotid)=7 AND handlerid > t.handlerid ORDER BY handlerid LIMIT 1)
FROM t
WHERE t.handlerid IS NOT NULL
)
SELECT handlerid FROM t WHERE handlerid IS NOT NULL;
对于此查询,您还需要创建我上面提到的索引。
我正在编写一个 node.js 应用程序来启用对 PostgreSQL 数据库的搜索。为了在搜索框中启用 twitter 预先输入,我必须从数据库中 c运行ch 一组关键字以在页面加载之前初始化 Bloodhound。如下所示:
SELECT distinct handlerid from lotintro where char_length(lotid)=7;
所以对于大 table (lotintro),这是昂贵的;这也很愚蠢,因为查询结果很可能在一段时间内对于不同的网络访问者保持不变。
处理这个问题的正确方法是什么?我在考虑几个选项:
1) 将查询放入存储过程并从 node.js:
中调用它 SELECT * from getallhandlerid()
这是否意味着查询将被编译并且数据库将自动 return 相同的结果集而无需实际 运行ning 查询知道结果不会改变?
2) 或者,创建一个单独的 table 来存储不同的 handlerid
并使用每天 运行s 的触发器更新 table? (我知道理想情况下,触发器应该 运行 每个 insert/update 到 table,但这成本太高了)。
3) 按照建议创建部分索引。这是收集的内容:
查询
SELECT distinct handlerid from lotintro where length(lotid) = 7;
索引
CREATE INDEX lotid7_idx ON lotintro (handlerid)
WHERE length(lotid) = 7;
有索引,查询耗时250ms左右,试试运行
explain (analyze on, TIMING OFF) SELECT distinct handlerid from lotintro where length(lotid) = 7
"HashAggregate (cost=5542.64..5542.65 rows=1 width=6) (actual rows=151 loops=1)"
" -> Bitmap Heap Scan on lotintro (cost=39.08..5537.50 rows=2056 width=6) (actual rows=298350 loops=1)"
" Recheck Cond: (length(lotid) = 7)"
" Rows Removed by Index Recheck: 55285"
" -> Bitmap Index Scan on lotid7_idx (cost=0.00..38.57 rows=2056 width=0) (actual rows=298350 loops=1)"
"Total runtime: 243.686 ms"
没有索引,查询耗时210ms左右,试试运行
explain (analyze on, TIMING OFF) SELECT distinct handlerid from lotintro where length(lotid) = 7
"HashAggregate (cost=19490.11..19490.12 rows=1 width=6) (actual rows=151 loops=1)"
" -> Seq Scan on lotintro (cost=0.00..19484.97 rows=2056 width=6) (actual rows=298350 loops=1)"
" Filter: (length(lotid) = 7)"
" Rows Removed by Filter: 112915"
"Total runtime: 214.235 ms"
我做错了什么?
4) 使用 alexius 建议的索引和查询:
create index on lotintro using btree(char_length(lotid), handlerid);
但这不是最佳解决方案。因为只有几个不同的值,你可以使用称为松散索引扫描的技巧,在你的情况下它应该工作得更快:
explain (analyze on, BUFFERS on, TIMING OFF)
WITH RECURSIVE t AS (
(SELECT handlerid FROM lotintro WHERE char_length(lotid)=7 ORDER BY handlerid LIMIT 1) -- parentheses required
UNION ALL
SELECT (SELECT handlerid FROM lotintro WHERE char_length(lotid)=7 AND handlerid > t.handlerid ORDER BY handlerid LIMIT 1)
FROM t
WHERE t.handlerid IS NOT NULL
)
SELECT handlerid FROM t WHERE handlerid IS NOT NULL;
"CTE Scan on t (cost=444.52..446.54 rows=100 width=32) (actual rows=151 loops=1)"
" Filter: (handlerid IS NOT NULL)"
" Rows Removed by Filter: 1"
" Buffers: shared hit=608"
" CTE t"
" -> Recursive Union (cost=0.42..444.52 rows=101 width=32) (actual rows=152 loops=1)"
" Buffers: shared hit=608"
" -> Limit (cost=0.42..4.17 rows=1 width=6) (actual rows=1 loops=1)"
" Buffers: shared hit=4"
" -> Index Scan using lotid_btree on lotintro lotintro_1 (cost=0.42..7704.41 rows=2056 width=6) (actual rows=1 loops=1)"
" Index Cond: (char_length(lotid) = 7)"
" Buffers: shared hit=4"
" -> WorkTable Scan on t t_1 (cost=0.00..43.83 rows=10 width=32) (actual rows=1 loops=152)"
" Filter: (handlerid IS NOT NULL)"
" Rows Removed by Filter: 0"
" Buffers: shared hit=604"
" SubPlan 1"
" -> Limit (cost=0.42..4.36 rows=1 width=6) (actual rows=1 loops=151)"
" Buffers: shared hit=604"
" -> Index Scan using lotid_btree on lotintro (cost=0.42..2698.13 rows=685 width=6) (actual rows=1 loops=151)"
" Index Cond: ((char_length(lotid) = 7) AND (handlerid > t_1.handlerid))"
" Buffers: shared hit=604"
"Planning time: 1.574 ms"
**"Execution time: 25.476 ms"**
=========关于数据库的更多信息============================
dataloggerDB=# \d lotintro Table"public.lotintro"
Column | Type | Modifiers
--------------+-----------------------------+--------------
lotstartdt | timestamp without time zone | not null
lotid | text | not null
ftc | text | not null
deviceid | text | not null
packageid | text | not null
testprogname | text | not null
testprogdir | text | not null
testgrade | text | not null
testgroup | text | not null
temperature | smallint | not null
testerid | text | not null
handlerid | text | not null
numofsite | text | not null
masknum | text |
soaktime | text |
xamsqty | smallint |
scd | text |
speedgrade | text |
loginid | text |
operatorid | text | not null
loadboardid | text | not null
checksum | text |
lotenddt | timestamp without time zone | not null
totaltest | integer | default (-1)
totalpass | integer | default (-1)
earnhour | real | default 0
avetesttime | real | default 0
Indexes:
"pkey_lotintro" PRIMARY KEY, btree (lotstartdt, testerid)
"lotid7_idx" btree (handlerid) WHERE length(lotid) = 7
your version of Postgres, [PostgreSQL 9.2] cardinalities (how many rows?), [411K rows for table lotintro] percentage for length(lotid) = 7. [298350/411000= 73%]
============= 在将所有内容移植到 PG 9.4 之后 =====================
有索引:
explain (analyze on, BUFFERS on, TIMING OFF) SELECT distinct handlerid from lotintro where length(lotid) = 7
"HashAggregate (cost=5542.78..5542.79 rows=1 width=6) (actual rows=151 loops=1)"
" Group Key: handlerid"
" Buffers: shared hit=14242"
" -> Bitmap Heap Scan on lotintro (cost=39.22..5537.64 rows=2056 width=6) (actual rows=298350 loops=1)"
" Recheck Cond: (length(lotid) = 7)"
" Heap Blocks: exact=13313"
" Buffers: shared hit=14242"
" -> Bitmap Index Scan on lotid7_idx (cost=0.00..38.70 rows=2056 width=0) (actual rows=298350 loops=1)"
" Buffers: shared hit=929"
"Planning time: 0.256 ms"
"Execution time: 154.657 ms"
没有索引:
explain (analyze on, BUFFERS on, TIMING OFF) SELECT distinct handlerid from lotintro where length(lotid) = 7
"HashAggregate (cost=19490.11..19490.12 rows=1 width=6) (actual rows=151 loops=1)"
" Group Key: handlerid"
" Buffers: shared hit=13316"
" -> Seq Scan on lotintro (cost=0.00..19484.97 rows=2056 width=6) (actual rows=298350 loops=1)"
" Filter: (length(lotid) = 7)"
" Rows Removed by Filter: 112915"
" Buffers: shared hit=13316"
"Planning time: 0.168 ms"
"Execution time: 176.466 ms"
您需要为 WHERE
子句中使用的确切表达式编制索引:http://www.postgresql.org/docs/9.4/static/indexes-expressional.html
CREATE INDEX char_length_lotid_idx ON lotintro (char_length(lotid));
您还可以创建一个 STABLE
或 IMMUTABLE
函数来按照您的建议包装此查询:http://www.postgresql.org/docs/9.4/static/sql-createfunction.html
你最后的建议也是可行的,你要找的是MATERIALIZED VIEWS
:http://www.postgresql.org/docs/9.4/static/sql-creatematerializedview.html
这会阻止您编写自定义触发器来更新非规范化 table.
1)
不,函数不会以任何方式保留结果的快照。如果您定义函数 STABLE
(这是正确的),则有 一些 性能优化的潜力。 Per documentation:
A
STABLE
function cannot modify the database and is guaranteed to return the same results given the same arguments for all rows within a single statement.
IMMUTABLE
在这里 是错误的 并且可能导致错误。
所以这可以 极大地 使同一语句中的多个调用受益 - 但这不适合您的用例...
并且 plpgsql 函数的工作方式类似于 准备好的语句,在同一个 session:
中为您提供类似的奖励- Difference between language sql and language plpgsql in PostgreSQL functions
2)
尝试 MATERIALIZED VIEW
. With or without MV (or some other caching technique), a partial index 对您的特殊情况最有效:
CREATE INDEX lotid7_idx ON lotintro (handlerid)
WHERE length(lotid) = 7;
记住在应该使用索引的查询中包含索引条件,即使这看起来多余:
- PostgreSQL does not use a partial index
但是,正如您提供的那样:
percentage for length(lotid) = 7. [298350/411000= 73%]
该索引只有在您可以从中进行仅索引扫描时才会有所帮助,因为该条件几乎没有选择性。由于 table 具有非常宽的行,因此仅索引扫描可以快得多。
松散索引扫描
此外,rows=298350
被折叠为 rows=151
,因此松散的索引扫描将支付费用,正如我在此处解释的那样:
- Optimize GROUP BY query to retrieve latest record per user
或者在Postgres Wiki——其实是基于这个post.
WITH RECURSIVE t AS (
(SELECT handlerid FROM lotintro
WHERE length(lotid) = 7
ORDER BY 1 LIMIT 1)
UNION ALL
SELECT (SELECT handlerid FROM lotintro
WHERE length(lotid) = 7
AND handlerid > t.handlerid
ORDER BY 1 LIMIT 1)
FROM t
WHERE t.handlerid IS NOT NULL
)
SELECT handlerid FROM t
WHERE handlerid IS NOT NULL;
这会更快,但是,与我建议的部分索引 结合使用。由于部分索引的大小只有原来的一半左右,而且更新频率较低(取决于访问模式),因此总体来说更便宜。
如果保持 table 真空以允许仅索引扫描,速度会更快。如果你有很多写入,你可以为这个 table 设置更积极的存储参数:
最后,您可以使用基于此查询的物化视图更快地完成此操作。
由于 3/4 的行满足您的条件 (length(lotid) = 7),索引本身不会有太大帮助。由于仅索引扫描,您可能会使用此索引获得更好的性能:
create index on lotintro using btree(char_length(lotid), handlerid);
但这不是最佳解决方案。因为只有几个不同的值,所以您可以使用称为 loose index scan 的技巧,在您的情况下它应该工作得更快:
WITH RECURSIVE t AS (
(SELECT handlerid FROM lotintro WHERE char_length(lotid)=7 ORDER BY handlerid LIMIT 1) -- parentheses required
UNION ALL
SELECT (SELECT handlerid FROM lotintro WHERE char_length(lotid)=7 AND handlerid > t.handlerid ORDER BY handlerid LIMIT 1)
FROM t
WHERE t.handlerid IS NOT NULL
)
SELECT handlerid FROM t WHERE handlerid IS NOT NULL;
对于此查询,您还需要创建我上面提到的索引。