与 MS SQL 相比,PostgreSQL 在 Where 和 Group 上非常慢
PostgreSQL extremely slow on Where and Group comparing to MS SQL
在尝试解决我们数据库在 Postgre 中的性能问题 SQL 五天后,我决定向您寻求帮助!一周前,我们决定尝试将包含 60M 记录的数据库从 MSSQL 移动到 PostgreSQL,而我们下面的 SQL 在 PostgreSQL.
set random_page_cost=1;
set seq_page_cost=5;
set enable_seqscan=on;
set work_mem = '100MB';
SELECT
DATE("DateStamp"), "Result", Count(*), Sum("ConversionCost")
FROM
"Log"
WHERE
"UserId" = 7841 AND "DateStamp" > '2019-01-01' AND "DateStamp" < '2020-02-26'
GROUP BY
1,2
执行计划
Finalize GroupAggregate (cost=1332160.59..1726394.02 rows=3093547 width=21) (actual time=2929.936..3157.049 rows=714 loops=1) " Output: (date(""DateStamp"")), ""Result"", count(*), sum(""ConversionCost"")" " Group Key: (date(""Log"".""DateStamp"")), ""Log"".""Result""" Buffers: shared hit=2292 read=345810 -> Gather Merge (cost=1332160.59..1661945.12 rows=2577956 width=21) (actual time=2929.783..3156.616 rows=2037 loops=1) " Output: (date(""DateStamp"")), ""Result"", (PARTIAL count(*)), (PARTIAL sum(""ConversionCost""))"
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=6172 read=857125
-> Partial GroupAggregate (cost=1331160.56..1363385.01 rows=1288978 width=21) (actual time=2906.450..3089.056 rows=679 loops=3) " Output: (date(""DateStamp"")), ""Result"", PARTIAL count(*), PARTIAL sum(""ConversionCost"")" " Group Key: (date(""Log"".""DateStamp"")), ""Log"".""Result"""
Buffers: shared hit=6172 read=857125
Worker 0: actual time=2895.531..3058.852 rows=675 loops=1
Buffers: shared hit=1930 read=255687
Worker 1: actual time=2894.513..3052.916 rows=673 loops=1
Buffers: shared hit=1950 read=255628
-> Sort (cost=1331160.56..1334383.01 rows=1288978 width=9) (actual time=2906.435..2968.562 rows=1064916 loops=3) " Output: (date(""DateStamp"")), ""Result"", ""ConversionCost""" " Sort Key: (date(""Log"".""DateStamp"")), ""Log"".""Result"""
Sort Method: quicksort Memory: 94807kB
Worker 0: Sort Method: quicksort Memory: 69171kB
Worker 1: Sort Method: quicksort Memory: 69063kB
Buffers: shared hit=6172 read=857125
Worker 0: actual time=2895.518..2951.406 rows=951356 loops=1
Buffers: shared hit=1930 read=255687
Worker 1: actual time=2894.494..2947.892 rows=949038 loops=1
Buffers: shared hit=1950 read=255628
-> Parallel Index Scan using "IX_Log_UserId" on public."Log" (cost=0.56..1200343.50 rows=1288978 width=9) (actual time=0.087..2634.603 rows=1064916 loops=3) " Output: date(""DateStamp""), ""Result"", ""ConversionCost"""
Index Cond: ("Log"."UserId" = 7841)
Filter: (("Log"."DateStamp" > '2019-01-01 00:00:00'::timestamp without time zone) AND ("Log"."DateStamp" < '2020-02-26 00:00:00'::timestamp without time zone))
Buffers: shared hit=6144 read=857123
Worker 0: actual time=0.077..2653.065 rows=951356 loops=1
Buffers: shared hit=1917 read=255685
Worker 1: actual time=0.107..2654.640 rows=949038 loops=1
Buffers: shared hit=1935 read=255628 Planning Time: 0.330 ms Execution Time: 3163.850 ms
执行计划URLhttps://explain.depesz.com/s/zLNI
相同的 SQL 在 MSSQL 上花费不到 2 秒,但在 PostgreSQL 上甚至需要 10 秒。日志 table 包含大约 60M 条记录,"UserId" = 7841 AND "DateStamp" > '2019-01-01' AND "DateStamp" < '2020-02-26'
where 子句过滤大约 3M 条记录。
Table结构如下
create table "Log"
(
"Id" integer generated by default as identity
constraint "PK_Log"
primary key,
"Result" boolean not null,
"DateStamp" timestamp not null,
"ConversionCost" integer not null,
"UserId" integer not null
constraint "FK_Log_User_UserId"
references "User"
on delete cascade,
);
create index "IX_Log_ConversionCost"
on "Log" ("ConversionCost");
create index "IX_Log_DateStamp"
on "Log" ("DateStamp");
create index "IX_Log_Result"
on "Log" ("Result");
create index "IX_Log_UserId"
on "Log" ("UserId");
PostgreSQL 服务器是 6CPU 和 16GB ram 服务器,与我们旧的 MSSQL 2CPU 和 8GB RAM 相比,如您所见,PostgreSQL 有更多的计算资源,但是表现更差。两台服务器都有 SSD。
也许问题是PostgreSQL在性能方面不如MSSQL那么先进,这里什么也做不了?
您可以将查询改写为:
SELECT
DATE("DateStamp"), "Result", Count(*), Sum("ConversionCost")
FROM "Log"
WHERE "UserId" = 7841
AND "DateStamp" >= '2019-01-02'
AND "DateStamp" < '2020-02-26'
GROUP BY 1,2
那么,查询将从索引中受益匪浅:
create index "IX_Log_UserId" on "Log" ("UserId", "DateStamp"));
为了进一步提高性能,您可以创建一个覆盖索引:
create index "IX_Log_UserId" on "Log" (
"UserId",
"DateStamp",
"Result",
"ConversionCost"
);
在尝试解决我们数据库在 Postgre 中的性能问题 SQL 五天后,我决定向您寻求帮助!一周前,我们决定尝试将包含 60M 记录的数据库从 MSSQL 移动到 PostgreSQL,而我们下面的 SQL 在 PostgreSQL.
set random_page_cost=1;
set seq_page_cost=5;
set enable_seqscan=on;
set work_mem = '100MB';
SELECT
DATE("DateStamp"), "Result", Count(*), Sum("ConversionCost")
FROM
"Log"
WHERE
"UserId" = 7841 AND "DateStamp" > '2019-01-01' AND "DateStamp" < '2020-02-26'
GROUP BY
1,2
执行计划
Finalize GroupAggregate (cost=1332160.59..1726394.02 rows=3093547 width=21) (actual time=2929.936..3157.049 rows=714 loops=1) " Output: (date(""DateStamp"")), ""Result"", count(*), sum(""ConversionCost"")" " Group Key: (date(""Log"".""DateStamp"")), ""Log"".""Result""" Buffers: shared hit=2292 read=345810 -> Gather Merge (cost=1332160.59..1661945.12 rows=2577956 width=21) (actual time=2929.783..3156.616 rows=2037 loops=1) " Output: (date(""DateStamp"")), ""Result"", (PARTIAL count(*)), (PARTIAL sum(""ConversionCost""))"
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=6172 read=857125
-> Partial GroupAggregate (cost=1331160.56..1363385.01 rows=1288978 width=21) (actual time=2906.450..3089.056 rows=679 loops=3) " Output: (date(""DateStamp"")), ""Result"", PARTIAL count(*), PARTIAL sum(""ConversionCost"")" " Group Key: (date(""Log"".""DateStamp"")), ""Log"".""Result"""
Buffers: shared hit=6172 read=857125
Worker 0: actual time=2895.531..3058.852 rows=675 loops=1
Buffers: shared hit=1930 read=255687
Worker 1: actual time=2894.513..3052.916 rows=673 loops=1
Buffers: shared hit=1950 read=255628
-> Sort (cost=1331160.56..1334383.01 rows=1288978 width=9) (actual time=2906.435..2968.562 rows=1064916 loops=3) " Output: (date(""DateStamp"")), ""Result"", ""ConversionCost""" " Sort Key: (date(""Log"".""DateStamp"")), ""Log"".""Result"""
Sort Method: quicksort Memory: 94807kB
Worker 0: Sort Method: quicksort Memory: 69171kB
Worker 1: Sort Method: quicksort Memory: 69063kB
Buffers: shared hit=6172 read=857125
Worker 0: actual time=2895.518..2951.406 rows=951356 loops=1
Buffers: shared hit=1930 read=255687
Worker 1: actual time=2894.494..2947.892 rows=949038 loops=1
Buffers: shared hit=1950 read=255628
-> Parallel Index Scan using "IX_Log_UserId" on public."Log" (cost=0.56..1200343.50 rows=1288978 width=9) (actual time=0.087..2634.603 rows=1064916 loops=3) " Output: date(""DateStamp""), ""Result"", ""ConversionCost"""
Index Cond: ("Log"."UserId" = 7841)
Filter: (("Log"."DateStamp" > '2019-01-01 00:00:00'::timestamp without time zone) AND ("Log"."DateStamp" < '2020-02-26 00:00:00'::timestamp without time zone))
Buffers: shared hit=6144 read=857123
Worker 0: actual time=0.077..2653.065 rows=951356 loops=1
Buffers: shared hit=1917 read=255685
Worker 1: actual time=0.107..2654.640 rows=949038 loops=1
Buffers: shared hit=1935 read=255628 Planning Time: 0.330 ms Execution Time: 3163.850 ms
执行计划URLhttps://explain.depesz.com/s/zLNI
相同的 SQL 在 MSSQL 上花费不到 2 秒,但在 PostgreSQL 上甚至需要 10 秒。日志 table 包含大约 60M 条记录,"UserId" = 7841 AND "DateStamp" > '2019-01-01' AND "DateStamp" < '2020-02-26'
where 子句过滤大约 3M 条记录。
Table结构如下
create table "Log"
(
"Id" integer generated by default as identity
constraint "PK_Log"
primary key,
"Result" boolean not null,
"DateStamp" timestamp not null,
"ConversionCost" integer not null,
"UserId" integer not null
constraint "FK_Log_User_UserId"
references "User"
on delete cascade,
);
create index "IX_Log_ConversionCost"
on "Log" ("ConversionCost");
create index "IX_Log_DateStamp"
on "Log" ("DateStamp");
create index "IX_Log_Result"
on "Log" ("Result");
create index "IX_Log_UserId"
on "Log" ("UserId");
PostgreSQL 服务器是 6CPU 和 16GB ram 服务器,与我们旧的 MSSQL 2CPU 和 8GB RAM 相比,如您所见,PostgreSQL 有更多的计算资源,但是表现更差。两台服务器都有 SSD。
也许问题是PostgreSQL在性能方面不如MSSQL那么先进,这里什么也做不了?
您可以将查询改写为:
SELECT
DATE("DateStamp"), "Result", Count(*), Sum("ConversionCost")
FROM "Log"
WHERE "UserId" = 7841
AND "DateStamp" >= '2019-01-02'
AND "DateStamp" < '2020-02-26'
GROUP BY 1,2
那么,查询将从索引中受益匪浅:
create index "IX_Log_UserId" on "Log" ("UserId", "DateStamp"));
为了进一步提高性能,您可以创建一个覆盖索引:
create index "IX_Log_UserId" on "Log" (
"UserId",
"DateStamp",
"Result",
"ConversionCost"
);