Haskell 并行较慢
Haskell is slower in parallel
我一直在尝试并行化 Haskell 程序。令我惊讶的是,我的大部分尝试都让我的示例 运行 变慢了,所以我决定深入研究 Threadscope 并开始学习教程 here.
在part 5中引入了示例程序sudoku3,它被认为是一个很好的并行化Haskell程序的例子。所以我按照教程中的描述编译了它
ghc -O2 sudoku3.hs -threaded -rtsopts -eventlog
并且(不像教程中描述的那样)测量了它在有和没有并行性的情况下的速度。结果令人惊讶:
$ ./sudoku3 sudoku17.1000.txt +RTS -s -N1
1,181,490,684 bytes allocated in the heap
13,247,408 bytes copied during GC
1,094,432 bytes maximum residency (5 sample(s))
35,556 bytes maximum slop
3 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 2282 colls, 0 par 0.07s 0.08s 0.0000s 0.0061s
Gen 1 5 colls, 0 par 0.01s 0.01s 0.0012s 0.0044s
TASKS: 3 (1 bound, 2 peak workers (2 total), using -N1)
SPARKS: 1000 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 1000 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 2.51s ( 2.90s elapsed)
GC time 0.08s ( 0.08s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 2.59s ( 2.98s elapsed)
Alloc rate 469,886,714 bytes per MUT second
Productivity 97.0% of total user, 84.3% of total elapsed
gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
这里是双核的运行:
$ ./sudoku3 sudoku17.1000.txt +RTS -s -N2
1,207,033,704 bytes allocated in the heap
23,422,808 bytes copied during GC
1,066,716 bytes maximum residency (22 sample(s))
47,524 bytes maximum slop
5 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1488 colls, 1488 par 0.40s 0.40s 0.0003s 0.0147s
Gen 1 22 colls, 21 par 0.07s 0.06s 0.0026s 0.0087s
Parallel GC work balance: 39.57% (serial 0%, perfect 100%)
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)
SPARKS: 1000 (968 converted, 0 overflowed, 0 dud, 0 GC'd, 32 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 3.45s ( 2.96s elapsed)
GC time 0.47s ( 0.45s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 3.93s ( 3.41s elapsed)
Alloc rate 349,389,354 bytes per MUT second
Productivity 88.0% of total user, 101.4% of total elapsed
gc_alloc_block_sync: 913
whitehole_spin: 0
gen[0].sync: 25
gen[1].sync: 0
令人惊讶的是,具有两个内核的 运行 速度较慢。这是为什么?
我的 GHC 版本是 7.6.3
操作系统:Debian Jessie Linux/GNU i386 (i686)
我刚刚又做了一次测试,正如预期的那样,并行版本更快。
事实证明,衡量性能并不是一个好主意,而 CPU 则 100% 忙于转换视频。
我一直在尝试并行化 Haskell 程序。令我惊讶的是,我的大部分尝试都让我的示例 运行 变慢了,所以我决定深入研究 Threadscope 并开始学习教程 here.
在part 5中引入了示例程序sudoku3,它被认为是一个很好的并行化Haskell程序的例子。所以我按照教程中的描述编译了它
ghc -O2 sudoku3.hs -threaded -rtsopts -eventlog
并且(不像教程中描述的那样)测量了它在有和没有并行性的情况下的速度。结果令人惊讶:
$ ./sudoku3 sudoku17.1000.txt +RTS -s -N1
1,181,490,684 bytes allocated in the heap
13,247,408 bytes copied during GC
1,094,432 bytes maximum residency (5 sample(s))
35,556 bytes maximum slop
3 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 2282 colls, 0 par 0.07s 0.08s 0.0000s 0.0061s
Gen 1 5 colls, 0 par 0.01s 0.01s 0.0012s 0.0044s
TASKS: 3 (1 bound, 2 peak workers (2 total), using -N1)
SPARKS: 1000 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 1000 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 2.51s ( 2.90s elapsed)
GC time 0.08s ( 0.08s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 2.59s ( 2.98s elapsed)
Alloc rate 469,886,714 bytes per MUT second
Productivity 97.0% of total user, 84.3% of total elapsed
gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
这里是双核的运行:
$ ./sudoku3 sudoku17.1000.txt +RTS -s -N2
1,207,033,704 bytes allocated in the heap
23,422,808 bytes copied during GC
1,066,716 bytes maximum residency (22 sample(s))
47,524 bytes maximum slop
5 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1488 colls, 1488 par 0.40s 0.40s 0.0003s 0.0147s
Gen 1 22 colls, 21 par 0.07s 0.06s 0.0026s 0.0087s
Parallel GC work balance: 39.57% (serial 0%, perfect 100%)
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)
SPARKS: 1000 (968 converted, 0 overflowed, 0 dud, 0 GC'd, 32 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 3.45s ( 2.96s elapsed)
GC time 0.47s ( 0.45s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 3.93s ( 3.41s elapsed)
Alloc rate 349,389,354 bytes per MUT second
Productivity 88.0% of total user, 101.4% of total elapsed
gc_alloc_block_sync: 913
whitehole_spin: 0
gen[0].sync: 25
gen[1].sync: 0
令人惊讶的是,具有两个内核的 运行 速度较慢。这是为什么?
我的 GHC 版本是 7.6.3
操作系统:Debian Jessie Linux/GNU i386 (i686)
我刚刚又做了一次测试,正如预期的那样,并行版本更快。
事实证明,衡量性能并不是一个好主意,而 CPU 则 100% 忙于转换视频。