Sql:在 postgres 中避免来自 group by 的外部合并

Sql: Avoiding external merge from group by in postgres

我有一个相当简单的查询,它获取大型 table(略低于 1000 万行)的前十名分数,并按降序排列 returns。分数由使用 group by 子句聚合的总和组成,而 group by 似乎特别昂贵,使用 Sort Method: external merge Disk: 190080k

有什么方法可以加快速度吗?我已经在 user.test_id 和 user.score 上有索引(降序)。我不想更改 work_mem,因为我对 postgres 设置的控制有限。

查询:

select 
    (select test from test where top_scores.test_id = test.id), 
    (select type from test where top_scores.test_id = test.id), 
    sum_score
    from (select sum(score) as sum_score, 
          test_id
          from user
          group by test_id
          order by sum_score desc
          limit 10
    ) top_scores

查询计划:

Subquery Scan on top_scores  (cost=1412662.62..1412831.69 rows=10 width=16) (actual time=164098.107..164098.714 rows=10 loops=1)"
  ->  Limit  (cost=1412662.62..1412662.64 rows=10 width=16) (actual time=164098.042..164098.144 rows=10 loops=1)"
        ->  Sort  (cost=1412662.62..1419366.96 rows=2681736 width=16) (actual time=164098.033..164098.067 rows=10 loops=1)"
              Sort Key: (sum(user.score)) DESC"
              Sort Method: top-N heapsort  Memory: 25kB"
              ->  GroupAggregate  (cost=1271799.65..1354711.27 rows=2681736 width=16) (actual time=72815.313..152605.093 rows=2499234 loops=1)"
                    Group Key: user.test_id"
                    ->  Sort  (cost=1271799.65..1290497.74 rows=7479234 width=16) (actual time=72815.273..107823.507 rows=7479234 loops=1)"
                          Sort Key: user.test_id"
                          Sort Method: external merge  Disk: 190080kB"
                          ->  Seq Scan on user  (cost=0.00..162238.34 rows=7479234 width=16) (actual time=0.009..33795.669 rows=7479234 loops=1)"
  SubPlan 1"
    ->  Index Scan using test_id_idx on test  (cost=0.43..8.45 rows=1 width=14) (actual time=0.012..0.016 rows=1 loops=10)"
          Index Cond: (top_scores.test_id = id)"
  SubPlan 2"
    ->  Index Scan using test_id_idx on test test_1  (cost=0.43..8.45 rows=1 width=3) (actual time=0.006..0.010 rows=1 loops=10)"
          Index Cond: (top_scores.test_id = id)"
Planning time: 0.724 ms"
Execution time: 164135.458 ms"

根据@jjanes 在他的回答中提出的建议,尝试创建以下索引:

create index user_score_test_id_idx on user (score, test_id); 
create index user_test_id_score_idx on user (test_id, score);` 
create index user_test_id_score_desc_idx on user (test_id, score desc nulls last);
create index user_score_desc_test_id_idx on user (score desc nulls last, test_id);

和运行full vacuum user

这对执行时间没有明显影响,生成的查询计划与没有它们时完全相同。 (把它放在差异检查器中,唯一不同的是时间)

编辑:显然 full vacuum 不是我想要的。刚需vacuum

尝试:

select
    test, type, sum_score
from 
    test 
join
    (select 
        sum(score) as sum_score, 
        test_id
     from 
        user
     group by 
        test_id
     order by 
        sum_score desc
     limit 
        10
      ) top_scores
on 
   test.test_id = top_scores.test_id

我不希望 test_id 上的单独索引和分数在这里有所帮助。但是 (test_id, score) 上的多列索引应该使用 index-only 扫描,因此避免排序。如果它没有立即帮助,那么 VACUUM table 以设置可见性映射位。

此外,您的硬件似乎非常糟糕,或者可能只是极度超载。