GHC 7.10 生成的代码比旧版本慢

GHC 7.10 generates slower code than older versions

我意识到最新版本的 GHC (7.10.3) 生成的代码比旧版本慢得多。我目前的当前版本:

$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.10.3

我的本地计算机上还安装了另外两个旧版本。

我的测试代码取自herecollatz1.hs代码):

import Data.Word
import Data.List
import System.Environment

collatzNext :: Word32 -> Word32
collatzNext a = (if even a then a else 3*a+1) `div` 2

-- new code
collatzLen :: Word32 -> Int
collatzLen a0 = lenIterWhile collatzNext (/= 1) a0

lenIterWhile :: (a -> a) -> (a -> Bool) -> a -> Int
lenIterWhile next notDone start = len start 0 where
    len n m = if notDone n
                then len (next n) (m+1)
                else m
-- End of new code

main = do
    [a0] <- getArgs
    let max_a0 = (read a0)::Word32
    print $ maximum $ map (\a0 -> (collatzLen a0, a0)) [1..max_a0]

使用 GHC 7.4、7.6 和 7.10 编译会产生以下时间:

$ ~/Tools/ghc-7.4.2/bin/ghc -O2 Test.hs 
[1 of 1] Compiling Main             ( Test.hs, Test.o )
Linking Test ...

$ time ./Test 1000000
(329,837799)

real    0m1.879s
user    0m1.876s
sys     0m0.000s
$ ~/Tools/ghc-7.6.1/bin/ghc -O2 Test.hs 
[1 of 1] Compiling Main             ( Test.hs, Test.o )
Linking Test ...

$ time ./Test 1000000
(329,837799)

real    0m1.901s
user    0m1.896s
sys     0m0.000s
$ ~/Tools/ghc/bin/ghc -O2 Test.hs 
[1 of 1] Compiling Main             ( Test.hs, Test.o )
Linking Test ...

$ time ./Test 1000000
(329,837799)

real    0m10.562s
user    0m10.528s
sys     0m0.036s    

毫无疑问,最新版本的 GHC 生成的代码比前两个版本更差。我无法重现与博客相同的效率,但可能是因为我没有 LLVM 并且我没有作者使用的确切版本。不过,我相信结论是显而易见的。

我的问题是,总的来说,为什么会发生这种情况? GHC 不知何故变得比以前更糟。具体来说,如果我想调查,我应该如何开始?

这是两个配置文件的比较 (diff Test-GHC-7-8-4.prof Test-GHC-7-10-3.prof)

1c1                               
<       Fri Mar 11 19:58 2016 Time and Allocation Profiling Report  (Final)
---                               
>       Fri Mar 11 19:59 2016 Time and Allocation Profiling Report  (Final)
5,6c5,6                               
<       total time  =        2.40 secs   (2400 ticks @ 1000 us, 1 processor)
<       total alloc = 256,066,744 bytes  (excludes profiling overheads)
---                               
>       total time  =       10.89 secs   (10895 ticks @ 1000 us, 1 processor)
>       total alloc = 15,713,590,808 bytes  (excludes profiling overheads)
10,13c10,12                               
< lenIterWhile.len Main     93.8   0.0                    
< collatzMax       Main      2.2   93.7
< collatzNext      Main      2.0    0.0
< lenIterWhile     Main      1.5    6.2
---                                
> collatzNext      Main     79.6   89.4
> lenIterWhile.len Main     18.9    8.8
> collatzMax       Main      0.8    1.5

发生了一些非常奇怪的事情。虽然在 GHC 中 lenIterWhile.len 占用了大部分时间,但 collatzNext 现在是罪魁祸首。让我们看看转储的核心:

-- GHC 7.8.4
Rec {
Main.$wlen [Occ=LoopBreaker]
  :: GHC.Prim.Word# -> GHC.Prim.Int# -> GHC.Prim.Int#
[GblId, Arity=2, Caf=NoCafRefs, Str=DmdType <S,1*U><L,U>]
Main.$wlen =
  \ (ww_s4Mn :: GHC.Prim.Word#) (ww1_s4Mr :: GHC.Prim.Int#) ->
    case ww_s4Mn of wild_XQ {
      __DEFAULT ->
        case GHC.Prim.remWord# wild_XQ (__word 2) of _ [Occ=Dead] {
          __DEFAULT ->
            Main.$wlen
              (GHC.Prim.quotWord#
                 (GHC.Prim.narrow32Word#
                    (GHC.Prim.plusWord#
                       (GHC.Prim.narrow32Word# (GHC.Prim.timesWord# (__word 3) wild_XQ))
                       (__word 1)))
                 (__word 2))
              (GHC.Prim.+# ww1_s4Mr 1);
          __word 0 ->
            Main.$wlen
              (GHC.Prim.quotWord# wild_XQ (__word 2)) (GHC.Prim.+# ww1_s4Mr 1)
        };
      __word 1 -> ww1_s4Mr
    }
end Rec }

似乎或多或少是合理的。现在关于 GHC 7.10.3:

Rec {
$wlen_r6Sy :: GHC.Prim.Word# -> GHC.Prim.Int# -> GHC.Prim.Int#
[GblId, Arity=2, Str=DmdType <S,U><L,U>]
$wlen_r6Sy =
  \ (ww_s60s :: GHC.Prim.Word#) (ww1_s60w :: GHC.Prim.Int#) ->
    case ww_s60s of wild_X1Z {
      __DEFAULT ->
        case even
               @ Word32 GHC.Word.$fIntegralWord32 (GHC.Word.W32# wild_X1Z)
        of _ [Occ=Dead] {
          False ->
            $wlen_r6Sy
              (GHC.Prim.quotWord#
                 (GHC.Prim.narrow32Word#
                    (GHC.Prim.plusWord#
                       (GHC.Prim.narrow32Word# (GHC.Prim.timesWord# (__word 3) wild_X1Z))
                       (__word 1)))
                 (__word 2))
              (GHC.Prim.+# ww1_s60w 1);
          True ->
            $wlen_r6Sy
              (GHC.Prim.quotWord# wild_X1Z (__word 2)) (GHC.Prim.+# ww1_s60w 1)
        };
      __word 1 -> ww1_s60w
    }
end Rec }

好吧,好像是一样的。 even的调用除外。让我们将 even 替换为 Integral 的一种内联变体,例如x <code>rem 2 == 0:

import Data.Word
import Data.List
import System.Environment

collatzNext :: Word32 -> Word32
collatzNext a = (if a `rem` 2 == 0 then a else 3*a+1) `div` 2

-- rest of code the same

让我们再次编译它并进行分析和检查:

$ stack --resolver=ghc-7.10 ghc -- Test.hs -O2 -fforce-recomp -prof -fprof-auto -auto-all
$ ./Test +RTS -s -p -RTS 
(329,837799)
     416,119,240 bytes allocated in the heap
          69,760 bytes copied during GC
          59,368 bytes maximum residency (2 sample(s))
          21,912 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       800 colls,     0 par    0.000s   0.002s     0.0000s    0.0001s
  Gen  1         2 colls,     0 par    0.000s   0.000s     0.0002s    0.0003s

  INIT    time    0.000s  (  0.019s elapsed)
  MUT     time    2.500s  (  2.546s elapsed)
  GC      time    0.000s  (  0.003s elapsed)
  RP      time    0.000s  (  0.000s elapsed)
  PROF    time    0.000s  (  0.000s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    2.500s  (  2.567s elapsed)

  %GC     time       0.0%  (0.1% elapsed)

  Alloc rate    166,447,696 bytes per MUT second

  Productivity 100.0% of total user, 97.4% of total elapsed

$ cat Test.prof
        Fri Mar 11 20:22 2016 Time and Allocation Profiling Report  (Final)

           Test.exe +RTS -s -p -RTS 1000000

        total time  =        2.54 secs   (2535 ticks @ 1000 us, 1 processor)
        total alloc = 256,066,984 bytes  (excludes profiling overheads)

COST CENTRE      MODULE  %time %alloc

lenIterWhile.len Main     94.4    0.0
main             Main      1.9   93.7
collatzNext      Main      1.8    0.0
lenIterWhile     Main      1.3    6.2

                                                                   individual     inherited
COST CENTRE           MODULE                     no.     entries  %time %alloc   %time %alloc

MAIN                  MAIN                        44           0    0.0    0.0   100.0  100.0
 main                 Main                        89           0    1.9   93.7   100.0  100.0
  main.\              Main                        92     1000000    0.4    0.0    98.1    6.2
   collatzLen         Main                        93     1000000    0.2    0.0    97.8    6.2
    lenIterWhile      Main                        94     1000000    1.3    6.2    97.5    6.2
     lenIterWhile.len Main                        95    88826840   94.4    0.0    96.2    0.0
      collatzNext     Main                        96    87826840    1.8    0.0     1.8    0.0
  main.max_a0         Main                        90           1    0.0    0.0     0.0    0.0
 CAF                  GHC.IO.Encoding.CodePage    73           0    0.0    0.0     0.0    0.0
 CAF                  System.Environment          64           0    0.0    0.0     0.0    0.0
 CAF                  GHC.IO.Handle.Text          62           0    0.0    0.0     0.0    0.0
 CAF                  GHC.IO.Encoding             61           0    0.0    0.0     0.0    0.0

似乎已经解决了。所以问题是 GHC-7.8 内联 even,而 GHC-7.10 没有。发生这种情况是因为为 IntInteger 添加了 {-# SPECIALISE even :: x -> x -> Bool #-} 规则,这些规则不允许内联。

issue's discussion documents making even and odd {-# INLINEABLE ... #-} would resolve this issue. Note that the specialisation itself was added for perfomance reasons.