为什么 `-threaded` 会变慢?
Why does `-threaded` make it slower?
一个简单的计划:
import qualified Data.ByteString.Lazy.Char8 as BS
main = do
wc <- length . BS.words <$> BS.getContents
print wc
构建速度:
ghc -fllvm -O2 -threaded -rtsopts Words.hs
更多 CPU 意味着更慢?
$ time ./Words +RTS -qa -N1 < big.txt
331041862
real 0m25.963s
user 0m21.747s
sys 0m1.528s
$ time ./Words +RTS -qa -N2 < big.txt
331041862
real 0m36.410s
user 0m34.910s
sys 0m6.892s
$ time ./Words +RTS -qa -N4 < big.txt
331041862
real 0m42.150s
user 0m55.393s
sys 0m16.227s
好的衡量标准:
$time wc -w big.txt
331041862 big.txt
real 0m8.277s
user 0m7.553s
sys 0m0.529s
很明显,这是一个单线程activity。不过,我想知道为什么它会慢这么多。
另外,你有什么建议,我怎样才能让它与 wc 竞争?
这是GC。用 +RTS -s
执行了你的程序,结果说明了一切。
-N1
D:\>a +RTS -qa -N1 -s < lorem.txt
15470835
4,558,095,152 bytes allocated in the heap
1,746,720 bytes copied during GC
77,936 bytes maximum residency (118 sample(s))
131,856 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 8519 colls, 0 par 0.016s 0.021s 0.0000s 0.0001s
Gen 1 118 colls, 0 par 0.000s 0.004s 0.0000s 0.0001s
TASKS: 3 (1 bound, 2 peak workers (2 total), using -N1)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.000s ( 0.001s elapsed)
MUT time 0.842s ( 0.855s elapsed)
GC time 0.016s ( 0.025s elapsed)
EXIT time 0.016s ( 0.000s elapsed)
Total time 0.874s ( 0.881s elapsed)
Alloc rate 5,410,809,512 bytes per MUT second
Productivity 98.2% of total user, 97.4% of total elapsed
gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
-N4
D:\>a +RTS -qa -N4 -s < lorem.txt
15470835
4,558,093,352 bytes allocated in the heap
1,720,232 bytes copied during GC
77,936 bytes maximum residency (113 sample(s))
160,432 bytes maximum slop
4 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 8524 colls, 8524 par 4.742s 1.678s 0.0002s 0.0499s
Gen 1 113 colls, 112 par 0.031s 0.027s 0.0002s 0.0099s
Parallel GC work balance: 1.40% (serial 0%, perfect 100%)
TASKS: 6 (1 bound, 5 peak workers (5 total), using -N4)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.000s ( 0.001s elapsed)
MUT time 1.950s ( 1.415s elapsed)
GC time 4.774s ( 1.705s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 6.724s ( 3.121s elapsed)
Alloc rate 2,337,468,786 bytes per MUT second
Productivity 29.0% of total user, 62.5% of total elapsed
gc_alloc_block_sync: 21082
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
最重要的部分是
Tot time (elapsed) Avg pause Max pause
Gen 0 8524 colls, 8524 par 4.742s 1.678s 0.0002s 0.0499s
和
Parallel GC work balance: 1.40% (serial 0%, perfect 100%)
当-threaded
开关打开时,在运行时ghc会尽可能地平衡线程之间的任何工作。您的整个程序是一个顺序过程,因此唯一可以转移到其他线程的工作是 GC,而您的程序实际上不能并行进行 GC,因此这些线程等待彼此完成他们的工作,导致大量时间浪费在同步上.
如果您通过 +RTS -qm
告诉运行时不要在线程之间进行平衡,那么有时 -N4 与 -N1 一样快。
一个简单的计划:
import qualified Data.ByteString.Lazy.Char8 as BS
main = do
wc <- length . BS.words <$> BS.getContents
print wc
构建速度:
ghc -fllvm -O2 -threaded -rtsopts Words.hs
更多 CPU 意味着更慢?
$ time ./Words +RTS -qa -N1 < big.txt
331041862
real 0m25.963s
user 0m21.747s
sys 0m1.528s
$ time ./Words +RTS -qa -N2 < big.txt
331041862
real 0m36.410s
user 0m34.910s
sys 0m6.892s
$ time ./Words +RTS -qa -N4 < big.txt
331041862
real 0m42.150s
user 0m55.393s
sys 0m16.227s
好的衡量标准:
$time wc -w big.txt
331041862 big.txt
real 0m8.277s
user 0m7.553s
sys 0m0.529s
很明显,这是一个单线程activity。不过,我想知道为什么它会慢这么多。
另外,你有什么建议,我怎样才能让它与 wc 竞争?
这是GC。用 +RTS -s
执行了你的程序,结果说明了一切。
-N1
D:\>a +RTS -qa -N1 -s < lorem.txt
15470835
4,558,095,152 bytes allocated in the heap
1,746,720 bytes copied during GC
77,936 bytes maximum residency (118 sample(s))
131,856 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 8519 colls, 0 par 0.016s 0.021s 0.0000s 0.0001s
Gen 1 118 colls, 0 par 0.000s 0.004s 0.0000s 0.0001s
TASKS: 3 (1 bound, 2 peak workers (2 total), using -N1)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.000s ( 0.001s elapsed)
MUT time 0.842s ( 0.855s elapsed)
GC time 0.016s ( 0.025s elapsed)
EXIT time 0.016s ( 0.000s elapsed)
Total time 0.874s ( 0.881s elapsed)
Alloc rate 5,410,809,512 bytes per MUT second
Productivity 98.2% of total user, 97.4% of total elapsed
gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
-N4
D:\>a +RTS -qa -N4 -s < lorem.txt
15470835
4,558,093,352 bytes allocated in the heap
1,720,232 bytes copied during GC
77,936 bytes maximum residency (113 sample(s))
160,432 bytes maximum slop
4 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 8524 colls, 8524 par 4.742s 1.678s 0.0002s 0.0499s
Gen 1 113 colls, 112 par 0.031s 0.027s 0.0002s 0.0099s
Parallel GC work balance: 1.40% (serial 0%, perfect 100%)
TASKS: 6 (1 bound, 5 peak workers (5 total), using -N4)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.000s ( 0.001s elapsed)
MUT time 1.950s ( 1.415s elapsed)
GC time 4.774s ( 1.705s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 6.724s ( 3.121s elapsed)
Alloc rate 2,337,468,786 bytes per MUT second
Productivity 29.0% of total user, 62.5% of total elapsed
gc_alloc_block_sync: 21082
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
最重要的部分是
Tot time (elapsed) Avg pause Max pause
Gen 0 8524 colls, 8524 par 4.742s 1.678s 0.0002s 0.0499s
和
Parallel GC work balance: 1.40% (serial 0%, perfect 100%)
当-threaded
开关打开时,在运行时ghc会尽可能地平衡线程之间的任何工作。您的整个程序是一个顺序过程,因此唯一可以转移到其他线程的工作是 GC,而您的程序实际上不能并行进行 GC,因此这些线程等待彼此完成他们的工作,导致大量时间浪费在同步上.
如果您通过 +RTS -qm
告诉运行时不要在线程之间进行平衡,那么有时 -N4 与 -N1 一样快。