PostgreSQL - 在 VIEW 上加入慢速查询
PostgreSQL - Slow query joining on a VIEW
我正在尝试在 table(玩家)和视图(player_main_colors)之间进行简单连接:
SELECT P.*, C.main_color FROM players P
OUTER LEFT JOIN player_main_colors C USING (player_id)
WHERE P.user_id=1;
此查询耗时约 40 毫秒。
这里我在 VIEW 上使用嵌套 SELECT 而不是 JOIN:
SELECT player_id, main_color FROM player_main_colors
WHERE player_id IN (
SELECT player_id FROM players WHERE user_id=1);
此查询也需要 ~40 毫秒。
当我将查询分成两部分时,它变得如我预期的那样快:
SELECT player_id FROM players WHERE user_id=1;
SELECT player_id, main_color FROM player_main_colors
where player_id in (584, 9337, 11669, 12096, 13651,
13852, 9575, 23388, 14339, 500, 24963, 25630,
8974, 13048, 11904, 10537, 20362, 9216, 4747, 25045);
这些查询每次大约需要 0.5 毫秒。
那么,为什么上述带有 JOIN 或 sub-SELECT 的查询如此缓慢,我该如何解决?
以下是关于我的 table 和视图的一些详细信息:
CREATE TABLE users (
user_id INTEGER PRIMARY KEY,
...
)
CREATE TABLE players (
player_id INTEGER PRIMARY KEY,
user_id INTEGER NOT NULL REFERENCES users (user_id),
...
)
CREATE TABLE player_data (
player_id INTEGER NOT NULL REFERENCES players (player_id),
game_id INTEGER NOT NULL,
color INTEGER NOT NULL,
PRIMARY KEY (player_id, game_id, color),
active_time INTEGER DEFAULT 0,
...
)
CREATE VIEW player_main_colors AS
SELECT DISTINCT ON (1) player_id, color as main_color
FROM player_data
GROUP BY player_id, color
ORDER BY 1, MAX(active_time) DESC
看来一定是我的VIEW有问题...?
下面是对上面嵌套的 SELECT 查询的 EXPLAIN ANALYZE:
Merge Semi Join (cost=1877.59..2118.00 rows=6851 width=8) (actual time=32.946..38.471 rows=25 loops=1)
Merge Cond: (player_data.player_id = players.player_id)
-> Unique (cost=1733.19..1801.70 rows=13701 width=12) (actual time=32.651..37.209 rows=13419 loops=1)
-> Sort (cost=1733.19..1767.45 rows=13701 width=12) (actual time=32.646..34.918 rows=16989 loops=1)
Sort Key: player_data.player_id, (max(player_data.active_time))
Sort Method: external merge Disk: 376kB
-> HashAggregate (cost=654.79..791.80 rows=13701 width=12) (actual time=13.636..19.051 rows=17311 loops=1)
-> Seq Scan on player_data (cost=0.00..513.45 rows=18845 width=12) (actual time=0.005..1.801 rows=18845 loops=1)
-> Sort (cost=144.40..144.53 rows=54 width=8) (actual time=0.226..0.230 rows=54 loops=1)
Sort Key: players.player_id
Sort Method: quicksort Memory: 19kB
-> Bitmap Heap Scan on players (cost=4.67..142.85 rows=54 width=8) (actual time=0.035..0.112 rows=54 loops=1)
Recheck Cond: (user_id = 1)
-> Bitmap Index Scan on test (cost=0.00..4.66 rows=54 width=0) (actual time=0.023..0.023 rows=54 loops=1)
Index Cond: (user_id = 1)
Total runtime: 39.279 ms
至于索引,除了我的主键的默认索引之外,我只有 1 个相关索引:
CREATE INDEX player_user_idx ON players (user_id);
我目前使用的是 PostgreSQL 9.2.9。
更新:
我已经减少了下面的问题。查看 IN (4747) 和 IN (SELECT 4747) 之间的区别。
慢:
>> explain analyze SELECT * FROM (
SELECT player_id, color
FROM player_data
GROUP BY player_id, color
ORDER BY MAX(active_time) DESC
) S
WHERE player_id IN (SELECT 4747);
Hash Join (cost=1749.99..1975.37 rows=6914 width=8) (actual time=30.492..34.291 rows=4 loops=1)
Hash Cond: (player_data.player_id = (4747))
-> Sort (cost=1749.95..1784.51 rows=13827 width=12) (actual time=30.391..32.655 rows=17464 loops=1)
Sort Key: (max(player_data.active_time))
Sort Method: external merge Disk: 376kB
-> HashAggregate (cost=660.71..798.98 rows=13827 width=12) (actual time=12.714..17.249 rows=17464 loops=1)
-> Seq Scan on player_data (cost=0.00..518.12 rows=19012 width=12) (actual time=0.006..1.898 rows=19012 loops=1)
-> Hash (cost=0.03..0.03 rows=1 width=4) (actual time=0.007..0.007 rows=1 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
-> HashAggregate (cost=0.02..0.03 rows=1 width=4) (actual time=0.006..0.006 rows=1 loops=1)
-> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.001..0.001 rows=1 loops=1)
Total runtime: 35.015 ms
(12 rows)
Time: 35.617 ms
快:
>> explain analyze SELECT * FROM (
SELECT player_id, color
FROM player_data
GROUP BY player_id, color
ORDER BY MAX(active_time) DESC
) S
WHERE player_id IN (4747);
Subquery Scan on s (cost=17.40..17.45 rows=4 width=8) (actual time=0.035..0.035 rows=4 loops=1)
-> Sort (cost=17.40..17.41 rows=4 width=12) (actual time=0.034..0.034 rows=4 loops=1)
Sort Key: (max(player_data.active_time))
Sort Method: quicksort Memory: 17kB
-> GroupAggregate (cost=0.00..17.36 rows=4 width=12) (actual time=0.020..0.027 rows=4 loops=1)
-> Index Scan using player_data_pkey on player_data (cost=0.00..17.28 rows=5 width=12) (actual time=0.014..0.019 rows=5 loops=1)
Index Cond: (player_id = 4747)
Total runtime: 0.080 ms
(8 rows)
Time: 0.610 ms
所以,出现这种行为的原因是查询规划器有局限性。在具体的bind param情况下,query planner能够根据它能看到和分析的query制定具体的计划。然而,当事情通过连接和子选择发生时,对将要发生的事情的可见性要低得多。它使优化器使用更多 "generic" 计划 - 在这种情况下速度明显较慢。
您的正确答案似乎是进行两项选择。也许更好的答案是将 "main_color" 非规范化到您的播放器 table 上并定期更新它。
您的 VIEW 定义中同时包含 GROUP BY
和 DISTINCT ON
。这就像开枪打死人。简化:
CREATE VIEW player_main_colors AS
SELECT DISTINCT ON (1)
player_id, color AS main_color
FROM player_data
ORDER BY 1, active_time DESC NULLS LAST;
NULLS LAST
必须等同于您的原始内容,因为根据您的 table 定义,active_time
可以为 NULL。应该更快。但还有更多。为了获得最佳性能,请创建这些 indexes:
CREATE INDEX players_up_idx ON players (user_id, player_id);
CREATE INDEX players_data_pa_idx ON player_data
(player_id, active_time DESC NULLS LAST, color);
也使用 DESC NULLS LAST
in the index 来匹配查询的排序顺序。或者您将 player_data.active_time
更改为 NOT NULL
并简化所有内容。
顺便说一句,它是 LEFT OUTER JOIN
而不是 OUTER LEFT JOIN
,或者只是省略干扰词 OUTER
:
SELECT * -- equivalent here to "p.*, c.main_color"
FROM players p
LEFT JOIN player_main_colors c USING (player_id)
WHERE p.user_id = 1;
我假设每个 player_id
在 player_data
中有 很多 行。而您只选择了几个player_id
。 JOIN LATERAL
对于这种情况是最快的,但你需要 Postgres 9.3 或更高版本。在 pg 9.2 中,您可以使用 correlated subqueries:
实现类似的效果
CREATE VIEW player_main_colors AS
SELECT player_id
, (SELECT color
FROM player_data
WHERE player_id = p.player_id
ORDER BY active_time DESC NULLS LAST
LIMIT 1) AS main_color
FROM players p
ORDER BY 1 -- optional
与您的原始观点有细微差别:这包括 player_data
中没有任何条目的玩家。您可以根据新视图尝试与上面相同的查询。但我 根本不会使用视图 。这可能是 最快的:
SELECT *
, (SELECT color
FROM player_data
WHERE player_id = p.player_id
ORDER BY active_time DESC NULLS LAST
LIMIT 1) AS main_color
FROM players p
WHERE p.user_id = 1;
详细解释:
- Optimize GROUP BY query to retrieve latest record per user
我正在尝试在 table(玩家)和视图(player_main_colors)之间进行简单连接:
SELECT P.*, C.main_color FROM players P
OUTER LEFT JOIN player_main_colors C USING (player_id)
WHERE P.user_id=1;
此查询耗时约 40 毫秒。
这里我在 VIEW 上使用嵌套 SELECT 而不是 JOIN:
SELECT player_id, main_color FROM player_main_colors
WHERE player_id IN (
SELECT player_id FROM players WHERE user_id=1);
此查询也需要 ~40 毫秒。
当我将查询分成两部分时,它变得如我预期的那样快:
SELECT player_id FROM players WHERE user_id=1;
SELECT player_id, main_color FROM player_main_colors
where player_id in (584, 9337, 11669, 12096, 13651,
13852, 9575, 23388, 14339, 500, 24963, 25630,
8974, 13048, 11904, 10537, 20362, 9216, 4747, 25045);
这些查询每次大约需要 0.5 毫秒。
那么,为什么上述带有 JOIN 或 sub-SELECT 的查询如此缓慢,我该如何解决?
以下是关于我的 table 和视图的一些详细信息:
CREATE TABLE users (
user_id INTEGER PRIMARY KEY,
...
)
CREATE TABLE players (
player_id INTEGER PRIMARY KEY,
user_id INTEGER NOT NULL REFERENCES users (user_id),
...
)
CREATE TABLE player_data (
player_id INTEGER NOT NULL REFERENCES players (player_id),
game_id INTEGER NOT NULL,
color INTEGER NOT NULL,
PRIMARY KEY (player_id, game_id, color),
active_time INTEGER DEFAULT 0,
...
)
CREATE VIEW player_main_colors AS
SELECT DISTINCT ON (1) player_id, color as main_color
FROM player_data
GROUP BY player_id, color
ORDER BY 1, MAX(active_time) DESC
看来一定是我的VIEW有问题...?
下面是对上面嵌套的 SELECT 查询的 EXPLAIN ANALYZE:
Merge Semi Join (cost=1877.59..2118.00 rows=6851 width=8) (actual time=32.946..38.471 rows=25 loops=1)
Merge Cond: (player_data.player_id = players.player_id)
-> Unique (cost=1733.19..1801.70 rows=13701 width=12) (actual time=32.651..37.209 rows=13419 loops=1)
-> Sort (cost=1733.19..1767.45 rows=13701 width=12) (actual time=32.646..34.918 rows=16989 loops=1)
Sort Key: player_data.player_id, (max(player_data.active_time))
Sort Method: external merge Disk: 376kB
-> HashAggregate (cost=654.79..791.80 rows=13701 width=12) (actual time=13.636..19.051 rows=17311 loops=1)
-> Seq Scan on player_data (cost=0.00..513.45 rows=18845 width=12) (actual time=0.005..1.801 rows=18845 loops=1)
-> Sort (cost=144.40..144.53 rows=54 width=8) (actual time=0.226..0.230 rows=54 loops=1)
Sort Key: players.player_id
Sort Method: quicksort Memory: 19kB
-> Bitmap Heap Scan on players (cost=4.67..142.85 rows=54 width=8) (actual time=0.035..0.112 rows=54 loops=1)
Recheck Cond: (user_id = 1)
-> Bitmap Index Scan on test (cost=0.00..4.66 rows=54 width=0) (actual time=0.023..0.023 rows=54 loops=1)
Index Cond: (user_id = 1)
Total runtime: 39.279 ms
至于索引,除了我的主键的默认索引之外,我只有 1 个相关索引:
CREATE INDEX player_user_idx ON players (user_id);
我目前使用的是 PostgreSQL 9.2.9。
更新:
我已经减少了下面的问题。查看 IN (4747) 和 IN (SELECT 4747) 之间的区别。
慢:
>> explain analyze SELECT * FROM (
SELECT player_id, color
FROM player_data
GROUP BY player_id, color
ORDER BY MAX(active_time) DESC
) S
WHERE player_id IN (SELECT 4747);
Hash Join (cost=1749.99..1975.37 rows=6914 width=8) (actual time=30.492..34.291 rows=4 loops=1)
Hash Cond: (player_data.player_id = (4747))
-> Sort (cost=1749.95..1784.51 rows=13827 width=12) (actual time=30.391..32.655 rows=17464 loops=1)
Sort Key: (max(player_data.active_time))
Sort Method: external merge Disk: 376kB
-> HashAggregate (cost=660.71..798.98 rows=13827 width=12) (actual time=12.714..17.249 rows=17464 loops=1)
-> Seq Scan on player_data (cost=0.00..518.12 rows=19012 width=12) (actual time=0.006..1.898 rows=19012 loops=1)
-> Hash (cost=0.03..0.03 rows=1 width=4) (actual time=0.007..0.007 rows=1 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
-> HashAggregate (cost=0.02..0.03 rows=1 width=4) (actual time=0.006..0.006 rows=1 loops=1)
-> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.001..0.001 rows=1 loops=1)
Total runtime: 35.015 ms
(12 rows)
Time: 35.617 ms
快:
>> explain analyze SELECT * FROM (
SELECT player_id, color
FROM player_data
GROUP BY player_id, color
ORDER BY MAX(active_time) DESC
) S
WHERE player_id IN (4747);
Subquery Scan on s (cost=17.40..17.45 rows=4 width=8) (actual time=0.035..0.035 rows=4 loops=1)
-> Sort (cost=17.40..17.41 rows=4 width=12) (actual time=0.034..0.034 rows=4 loops=1)
Sort Key: (max(player_data.active_time))
Sort Method: quicksort Memory: 17kB
-> GroupAggregate (cost=0.00..17.36 rows=4 width=12) (actual time=0.020..0.027 rows=4 loops=1)
-> Index Scan using player_data_pkey on player_data (cost=0.00..17.28 rows=5 width=12) (actual time=0.014..0.019 rows=5 loops=1)
Index Cond: (player_id = 4747)
Total runtime: 0.080 ms
(8 rows)
Time: 0.610 ms
所以,出现这种行为的原因是查询规划器有局限性。在具体的bind param情况下,query planner能够根据它能看到和分析的query制定具体的计划。然而,当事情通过连接和子选择发生时,对将要发生的事情的可见性要低得多。它使优化器使用更多 "generic" 计划 - 在这种情况下速度明显较慢。
您的正确答案似乎是进行两项选择。也许更好的答案是将 "main_color" 非规范化到您的播放器 table 上并定期更新它。
您的 VIEW 定义中同时包含 GROUP BY
和 DISTINCT ON
。这就像开枪打死人。简化:
CREATE VIEW player_main_colors AS
SELECT DISTINCT ON (1)
player_id, color AS main_color
FROM player_data
ORDER BY 1, active_time DESC NULLS LAST;
NULLS LAST
必须等同于您的原始内容,因为根据您的 table 定义,active_time
可以为 NULL。应该更快。但还有更多。为了获得最佳性能,请创建这些 indexes:
CREATE INDEX players_up_idx ON players (user_id, player_id);
CREATE INDEX players_data_pa_idx ON player_data
(player_id, active_time DESC NULLS LAST, color);
也使用 DESC NULLS LAST
in the index 来匹配查询的排序顺序。或者您将 player_data.active_time
更改为 NOT NULL
并简化所有内容。
顺便说一句,它是 LEFT OUTER JOIN
而不是 ,或者只是省略干扰词 OUTER LEFT JOIN
OUTER
:
SELECT * -- equivalent here to "p.*, c.main_color"
FROM players p
LEFT JOIN player_main_colors c USING (player_id)
WHERE p.user_id = 1;
我假设每个 player_id
在 player_data
中有 很多 行。而您只选择了几个player_id
。 JOIN LATERAL
对于这种情况是最快的,但你需要 Postgres 9.3 或更高版本。在 pg 9.2 中,您可以使用 correlated subqueries:
CREATE VIEW player_main_colors AS
SELECT player_id
, (SELECT color
FROM player_data
WHERE player_id = p.player_id
ORDER BY active_time DESC NULLS LAST
LIMIT 1) AS main_color
FROM players p
ORDER BY 1 -- optional
与您的原始观点有细微差别:这包括 player_data
中没有任何条目的玩家。您可以根据新视图尝试与上面相同的查询。但我 根本不会使用视图 。这可能是 最快的:
SELECT *
, (SELECT color
FROM player_data
WHERE player_id = p.player_id
ORDER BY active_time DESC NULLS LAST
LIMIT 1) AS main_color
FROM players p
WHERE p.user_id = 1;
详细解释:
- Optimize GROUP BY query to retrieve latest record per user