GHC 7.10 生成的代码比旧版本慢
GHC 7.10 generates slower code than older versions
我意识到最新版本的 GHC (7.10.3) 生成的代码比旧版本慢得多。我目前的当前版本:
$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.10.3
我的本地计算机上还安装了另外两个旧版本。
我的测试代码取自here(collatz1.hs
代码):
import Data.Word
import Data.List
import System.Environment
collatzNext :: Word32 -> Word32
collatzNext a = (if even a then a else 3*a+1) `div` 2
-- new code
collatzLen :: Word32 -> Int
collatzLen a0 = lenIterWhile collatzNext (/= 1) a0
lenIterWhile :: (a -> a) -> (a -> Bool) -> a -> Int
lenIterWhile next notDone start = len start 0 where
len n m = if notDone n
then len (next n) (m+1)
else m
-- End of new code
main = do
[a0] <- getArgs
let max_a0 = (read a0)::Word32
print $ maximum $ map (\a0 -> (collatzLen a0, a0)) [1..max_a0]
使用 GHC 7.4、7.6 和 7.10 编译会产生以下时间:
$ ~/Tools/ghc-7.4.2/bin/ghc -O2 Test.hs
[1 of 1] Compiling Main ( Test.hs, Test.o )
Linking Test ...
$ time ./Test 1000000
(329,837799)
real 0m1.879s
user 0m1.876s
sys 0m0.000s
$ ~/Tools/ghc-7.6.1/bin/ghc -O2 Test.hs
[1 of 1] Compiling Main ( Test.hs, Test.o )
Linking Test ...
$ time ./Test 1000000
(329,837799)
real 0m1.901s
user 0m1.896s
sys 0m0.000s
$ ~/Tools/ghc/bin/ghc -O2 Test.hs
[1 of 1] Compiling Main ( Test.hs, Test.o )
Linking Test ...
$ time ./Test 1000000
(329,837799)
real 0m10.562s
user 0m10.528s
sys 0m0.036s
毫无疑问,最新版本的 GHC 生成的代码比前两个版本更差。我无法重现与博客相同的效率,但可能是因为我没有 LLVM 并且我没有作者使用的确切版本。不过,我相信结论是显而易见的。
我的问题是,总的来说,为什么会发生这种情况? GHC 不知何故变得比以前更糟。具体来说,如果我想调查,我应该如何开始?
这是两个配置文件的比较 (diff Test-GHC-7-8-4.prof Test-GHC-7-10-3.prof
)
1c1
< Fri Mar 11 19:58 2016 Time and Allocation Profiling Report (Final)
---
> Fri Mar 11 19:59 2016 Time and Allocation Profiling Report (Final)
5,6c5,6
< total time = 2.40 secs (2400 ticks @ 1000 us, 1 processor)
< total alloc = 256,066,744 bytes (excludes profiling overheads)
---
> total time = 10.89 secs (10895 ticks @ 1000 us, 1 processor)
> total alloc = 15,713,590,808 bytes (excludes profiling overheads)
10,13c10,12
< lenIterWhile.len Main 93.8 0.0
< collatzMax Main 2.2 93.7
< collatzNext Main 2.0 0.0
< lenIterWhile Main 1.5 6.2
---
> collatzNext Main 79.6 89.4
> lenIterWhile.len Main 18.9 8.8
> collatzMax Main 0.8 1.5
发生了一些非常奇怪的事情。虽然在 GHC 中 lenIterWhile.len
占用了大部分时间,但 collatzNext
现在是罪魁祸首。让我们看看转储的核心:
-- GHC 7.8.4
Rec {
Main.$wlen [Occ=LoopBreaker]
:: GHC.Prim.Word# -> GHC.Prim.Int# -> GHC.Prim.Int#
[GblId, Arity=2, Caf=NoCafRefs, Str=DmdType <S,1*U><L,U>]
Main.$wlen =
\ (ww_s4Mn :: GHC.Prim.Word#) (ww1_s4Mr :: GHC.Prim.Int#) ->
case ww_s4Mn of wild_XQ {
__DEFAULT ->
case GHC.Prim.remWord# wild_XQ (__word 2) of _ [Occ=Dead] {
__DEFAULT ->
Main.$wlen
(GHC.Prim.quotWord#
(GHC.Prim.narrow32Word#
(GHC.Prim.plusWord#
(GHC.Prim.narrow32Word# (GHC.Prim.timesWord# (__word 3) wild_XQ))
(__word 1)))
(__word 2))
(GHC.Prim.+# ww1_s4Mr 1);
__word 0 ->
Main.$wlen
(GHC.Prim.quotWord# wild_XQ (__word 2)) (GHC.Prim.+# ww1_s4Mr 1)
};
__word 1 -> ww1_s4Mr
}
end Rec }
似乎或多或少是合理的。现在关于 GHC 7.10.3:
Rec {
$wlen_r6Sy :: GHC.Prim.Word# -> GHC.Prim.Int# -> GHC.Prim.Int#
[GblId, Arity=2, Str=DmdType <S,U><L,U>]
$wlen_r6Sy =
\ (ww_s60s :: GHC.Prim.Word#) (ww1_s60w :: GHC.Prim.Int#) ->
case ww_s60s of wild_X1Z {
__DEFAULT ->
case even
@ Word32 GHC.Word.$fIntegralWord32 (GHC.Word.W32# wild_X1Z)
of _ [Occ=Dead] {
False ->
$wlen_r6Sy
(GHC.Prim.quotWord#
(GHC.Prim.narrow32Word#
(GHC.Prim.plusWord#
(GHC.Prim.narrow32Word# (GHC.Prim.timesWord# (__word 3) wild_X1Z))
(__word 1)))
(__word 2))
(GHC.Prim.+# ww1_s60w 1);
True ->
$wlen_r6Sy
(GHC.Prim.quotWord# wild_X1Z (__word 2)) (GHC.Prim.+# ww1_s60w 1)
};
__word 1 -> ww1_s60w
}
end Rec }
好吧,好像是一样的。 even
的调用除外。让我们将 even
替换为 Integral
的一种内联变体,例如x <code>rem
2 == 0:
import Data.Word
import Data.List
import System.Environment
collatzNext :: Word32 -> Word32
collatzNext a = (if a `rem` 2 == 0 then a else 3*a+1) `div` 2
-- rest of code the same
让我们再次编译它并进行分析和检查:
$ stack --resolver=ghc-7.10 ghc -- Test.hs -O2 -fforce-recomp -prof -fprof-auto -auto-all
$ ./Test +RTS -s -p -RTS
(329,837799)
416,119,240 bytes allocated in the heap
69,760 bytes copied during GC
59,368 bytes maximum residency (2 sample(s))
21,912 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 800 colls, 0 par 0.000s 0.002s 0.0000s 0.0001s
Gen 1 2 colls, 0 par 0.000s 0.000s 0.0002s 0.0003s
INIT time 0.000s ( 0.019s elapsed)
MUT time 2.500s ( 2.546s elapsed)
GC time 0.000s ( 0.003s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 0.000s ( 0.000s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 2.500s ( 2.567s elapsed)
%GC time 0.0% (0.1% elapsed)
Alloc rate 166,447,696 bytes per MUT second
Productivity 100.0% of total user, 97.4% of total elapsed
$ cat Test.prof
Fri Mar 11 20:22 2016 Time and Allocation Profiling Report (Final)
Test.exe +RTS -s -p -RTS 1000000
total time = 2.54 secs (2535 ticks @ 1000 us, 1 processor)
total alloc = 256,066,984 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
lenIterWhile.len Main 94.4 0.0
main Main 1.9 93.7
collatzNext Main 1.8 0.0
lenIterWhile Main 1.3 6.2
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 44 0 0.0 0.0 100.0 100.0
main Main 89 0 1.9 93.7 100.0 100.0
main.\ Main 92 1000000 0.4 0.0 98.1 6.2
collatzLen Main 93 1000000 0.2 0.0 97.8 6.2
lenIterWhile Main 94 1000000 1.3 6.2 97.5 6.2
lenIterWhile.len Main 95 88826840 94.4 0.0 96.2 0.0
collatzNext Main 96 87826840 1.8 0.0 1.8 0.0
main.max_a0 Main 90 1 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.CodePage 73 0 0.0 0.0 0.0 0.0
CAF System.Environment 64 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.Text 62 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding 61 0 0.0 0.0 0.0 0.0
似乎已经解决了。所以问题是 GHC-7.8 内联 even
,而 GHC-7.10 没有。发生这种情况是因为为 Int
和 Integer
添加了 {-# SPECIALISE even :: x -> x -> Bool #-}
规则,这些规则不允许内联。
如issue's discussion documents making even
and odd
{-# INLINEABLE ... #-}
would resolve this issue. Note that the specialisation itself was added for perfomance reasons.
我意识到最新版本的 GHC (7.10.3) 生成的代码比旧版本慢得多。我目前的当前版本:
$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.10.3
我的本地计算机上还安装了另外两个旧版本。
我的测试代码取自here(collatz1.hs
代码):
import Data.Word
import Data.List
import System.Environment
collatzNext :: Word32 -> Word32
collatzNext a = (if even a then a else 3*a+1) `div` 2
-- new code
collatzLen :: Word32 -> Int
collatzLen a0 = lenIterWhile collatzNext (/= 1) a0
lenIterWhile :: (a -> a) -> (a -> Bool) -> a -> Int
lenIterWhile next notDone start = len start 0 where
len n m = if notDone n
then len (next n) (m+1)
else m
-- End of new code
main = do
[a0] <- getArgs
let max_a0 = (read a0)::Word32
print $ maximum $ map (\a0 -> (collatzLen a0, a0)) [1..max_a0]
使用 GHC 7.4、7.6 和 7.10 编译会产生以下时间:
$ ~/Tools/ghc-7.4.2/bin/ghc -O2 Test.hs
[1 of 1] Compiling Main ( Test.hs, Test.o )
Linking Test ...
$ time ./Test 1000000
(329,837799)
real 0m1.879s
user 0m1.876s
sys 0m0.000s
$ ~/Tools/ghc-7.6.1/bin/ghc -O2 Test.hs
[1 of 1] Compiling Main ( Test.hs, Test.o )
Linking Test ...
$ time ./Test 1000000
(329,837799)
real 0m1.901s
user 0m1.896s
sys 0m0.000s
$ ~/Tools/ghc/bin/ghc -O2 Test.hs
[1 of 1] Compiling Main ( Test.hs, Test.o )
Linking Test ...
$ time ./Test 1000000
(329,837799)
real 0m10.562s
user 0m10.528s
sys 0m0.036s
毫无疑问,最新版本的 GHC 生成的代码比前两个版本更差。我无法重现与博客相同的效率,但可能是因为我没有 LLVM 并且我没有作者使用的确切版本。不过,我相信结论是显而易见的。
我的问题是,总的来说,为什么会发生这种情况? GHC 不知何故变得比以前更糟。具体来说,如果我想调查,我应该如何开始?
这是两个配置文件的比较 (diff Test-GHC-7-8-4.prof Test-GHC-7-10-3.prof
)
1c1
< Fri Mar 11 19:58 2016 Time and Allocation Profiling Report (Final)
---
> Fri Mar 11 19:59 2016 Time and Allocation Profiling Report (Final)
5,6c5,6
< total time = 2.40 secs (2400 ticks @ 1000 us, 1 processor)
< total alloc = 256,066,744 bytes (excludes profiling overheads)
---
> total time = 10.89 secs (10895 ticks @ 1000 us, 1 processor)
> total alloc = 15,713,590,808 bytes (excludes profiling overheads)
10,13c10,12
< lenIterWhile.len Main 93.8 0.0
< collatzMax Main 2.2 93.7
< collatzNext Main 2.0 0.0
< lenIterWhile Main 1.5 6.2
---
> collatzNext Main 79.6 89.4
> lenIterWhile.len Main 18.9 8.8
> collatzMax Main 0.8 1.5
发生了一些非常奇怪的事情。虽然在 GHC 中 lenIterWhile.len
占用了大部分时间,但 collatzNext
现在是罪魁祸首。让我们看看转储的核心:
-- GHC 7.8.4
Rec {
Main.$wlen [Occ=LoopBreaker]
:: GHC.Prim.Word# -> GHC.Prim.Int# -> GHC.Prim.Int#
[GblId, Arity=2, Caf=NoCafRefs, Str=DmdType <S,1*U><L,U>]
Main.$wlen =
\ (ww_s4Mn :: GHC.Prim.Word#) (ww1_s4Mr :: GHC.Prim.Int#) ->
case ww_s4Mn of wild_XQ {
__DEFAULT ->
case GHC.Prim.remWord# wild_XQ (__word 2) of _ [Occ=Dead] {
__DEFAULT ->
Main.$wlen
(GHC.Prim.quotWord#
(GHC.Prim.narrow32Word#
(GHC.Prim.plusWord#
(GHC.Prim.narrow32Word# (GHC.Prim.timesWord# (__word 3) wild_XQ))
(__word 1)))
(__word 2))
(GHC.Prim.+# ww1_s4Mr 1);
__word 0 ->
Main.$wlen
(GHC.Prim.quotWord# wild_XQ (__word 2)) (GHC.Prim.+# ww1_s4Mr 1)
};
__word 1 -> ww1_s4Mr
}
end Rec }
似乎或多或少是合理的。现在关于 GHC 7.10.3:
Rec { $wlen_r6Sy :: GHC.Prim.Word# -> GHC.Prim.Int# -> GHC.Prim.Int# [GblId, Arity=2, Str=DmdType <S,U><L,U>] $wlen_r6Sy = \ (ww_s60s :: GHC.Prim.Word#) (ww1_s60w :: GHC.Prim.Int#) -> case ww_s60s of wild_X1Z { __DEFAULT -> case even @ Word32 GHC.Word.$fIntegralWord32 (GHC.Word.W32# wild_X1Z) of _ [Occ=Dead] { False -> $wlen_r6Sy (GHC.Prim.quotWord# (GHC.Prim.narrow32Word# (GHC.Prim.plusWord# (GHC.Prim.narrow32Word# (GHC.Prim.timesWord# (__word 3) wild_X1Z)) (__word 1))) (__word 2)) (GHC.Prim.+# ww1_s60w 1); True -> $wlen_r6Sy (GHC.Prim.quotWord# wild_X1Z (__word 2)) (GHC.Prim.+# ww1_s60w 1) }; __word 1 -> ww1_s60w } end Rec }
好吧,好像是一样的。 even
的调用除外。让我们将 even
替换为 Integral
的一种内联变体,例如x <code>rem
2 == 0:
import Data.Word
import Data.List
import System.Environment
collatzNext :: Word32 -> Word32
collatzNext a = (if a `rem` 2 == 0 then a else 3*a+1) `div` 2
-- rest of code the same
让我们再次编译它并进行分析和检查:
$ stack --resolver=ghc-7.10 ghc -- Test.hs -O2 -fforce-recomp -prof -fprof-auto -auto-all
$ ./Test +RTS -s -p -RTS
(329,837799)
416,119,240 bytes allocated in the heap
69,760 bytes copied during GC
59,368 bytes maximum residency (2 sample(s))
21,912 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 800 colls, 0 par 0.000s 0.002s 0.0000s 0.0001s
Gen 1 2 colls, 0 par 0.000s 0.000s 0.0002s 0.0003s
INIT time 0.000s ( 0.019s elapsed)
MUT time 2.500s ( 2.546s elapsed)
GC time 0.000s ( 0.003s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 0.000s ( 0.000s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 2.500s ( 2.567s elapsed)
%GC time 0.0% (0.1% elapsed)
Alloc rate 166,447,696 bytes per MUT second
Productivity 100.0% of total user, 97.4% of total elapsed
$ cat Test.prof
Fri Mar 11 20:22 2016 Time and Allocation Profiling Report (Final)
Test.exe +RTS -s -p -RTS 1000000
total time = 2.54 secs (2535 ticks @ 1000 us, 1 processor)
total alloc = 256,066,984 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
lenIterWhile.len Main 94.4 0.0
main Main 1.9 93.7
collatzNext Main 1.8 0.0
lenIterWhile Main 1.3 6.2
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 44 0 0.0 0.0 100.0 100.0
main Main 89 0 1.9 93.7 100.0 100.0
main.\ Main 92 1000000 0.4 0.0 98.1 6.2
collatzLen Main 93 1000000 0.2 0.0 97.8 6.2
lenIterWhile Main 94 1000000 1.3 6.2 97.5 6.2
lenIterWhile.len Main 95 88826840 94.4 0.0 96.2 0.0
collatzNext Main 96 87826840 1.8 0.0 1.8 0.0
main.max_a0 Main 90 1 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.CodePage 73 0 0.0 0.0 0.0 0.0
CAF System.Environment 64 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.Text 62 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding 61 0 0.0 0.0 0.0 0.0
似乎已经解决了。所以问题是 GHC-7.8 内联 even
,而 GHC-7.10 没有。发生这种情况是因为为 Int
和 Integer
添加了 {-# SPECIALISE even :: x -> x -> Bool #-}
规则,这些规则不允许内联。
如issue's discussion documents making even
and odd
{-# INLINEABLE ... #-}
would resolve this issue. Note that the specialisation itself was added for perfomance reasons.