为什么这个 Haskell 代码运行使用 -O 会变慢？

Question

这段 Haskell 代码在 -O 下运行很多慢，但 -O 应该是 non-dangerous. Can anyone tell me what happened? If it matters, it is an attempt to solve this problem，并且它使用二进制搜索和持久线段树：

import Control.Monad
import Data.Array

data Node =
      Leaf   Int           -- value
    | Branch Int Node Node -- sum, left child, right child
type NodeArray = Array Int Node

-- create an empty node with range [l, r)
create :: Int -> Int -> Node
create l r
    | l + 1 == r = Leaf 0
    | otherwise  = Branch 0 (create l m) (create m r)
    where m = (l + r) `div` 2

-- Get the sum in range [0, r). The range of the node is [nl, nr)
sumof :: Node -> Int -> Int -> Int -> Int
sumof (Leaf val) r nl nr
    | nr <= r   = val
    | otherwise = 0
sumof (Branch sum lc rc) r nl nr
    | nr <= r   = sum
    | r  > nl   = (sumof lc r nl m) + (sumof rc r m nr)
    | otherwise = 0
    where m = (nl + nr) `div` 2

-- Increase the value at x by 1. The range of the node is [nl, nr)
increase :: Node -> Int -> Int -> Int -> Node
increase (Leaf val) x nl nr = Leaf (val + 1)
increase (Branch sum lc rc) x nl nr
    | x < m     = Branch (sum + 1) (increase lc x nl m) rc
    | otherwise = Branch (sum + 1) lc (increase rc x m nr)
    where m = (nl + nr) `div` 2

-- signature said it all
tonodes :: Int -> [Int] -> [Node]
tonodes n = reverse . tonodes' . reverse
    where
        tonodes' :: [Int] -> [Node]
        tonodes' (h:t) = increase h' h 0 n : s' where s'@(h':_) = tonodes' t
        tonodes' _ = [create 0 n]

-- find the minimum m in [l, r] such that (predicate m) is True
binarysearch :: (Int -> Bool) -> Int -> Int -> Int
binarysearch predicate l r
    | l == r      = r
    | predicate m = binarysearch predicate l m
    | otherwise   = binarysearch predicate (m+1) r
    where m = (l + r) `div` 2

-- main, literally
main :: IO ()
main = do
    [n, m] <- fmap (map read . words) getLine
    nodes <- fmap (listArray (0, n) . tonodes n . map (subtract 1) . map read . words) getLine
    replicateM_ m $ query n nodes
    where
        query :: Int -> NodeArray -> IO ()
        query n nodes = do
            [p, k] <- fmap (map read . words) getLine
            print $ binarysearch (ok nodes n p k) 0 n
            where
                ok :: NodeArray -> Int -> Int -> Int -> Int -> Bool
                ok nodes n p k s = (sumof (nodes ! min (p + s + 1) n) s 0 n) - (sumof (nodes ! max (p - s) 0) s 0 n) >= k

（这与 code review 的代码完全相同，但这个问题解决了另一个问题。）

这是我的 C++ 输入生成器：

#include <cstdio>
#include <cstdlib>
using namespace std;
int main (int argc, char * argv[]) {
    srand(1827);
    int n = 100000;
    if(argc > 1)
        sscanf(argv[1], "%d", &n);
    printf("%d %d\n", n, n);
    for(int i = 0; i < n; i++)
        printf("%d%c", rand() % n + 1, i == n - 1 ? '\n' : ' ');
    for(int i = 0; i < n; i++) {
        int p = rand() % n;
        int k = rand() % n + 1;
        printf("%d %d\n", p, k);
    }
}

如果您没有可用的 C++ 编译器，this is the result of ./gen.exe 1000。

这是我电脑上的执行结果：

$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.8.3
$ ghc -fforce-recomp 1827.hs
[1 of 1] Compiling Main             ( 1827.hs, 1827.o )
Linking 1827.exe ...
$ time ./gen.exe 1000 | ./1827.exe > /dev/null
real    0m0.088s
user    0m0.015s
sys     0m0.015s
$ ghc -fforce-recomp -O 1827.hs
[1 of 1] Compiling Main             ( 1827.hs, 1827.o )
Linking 1827.exe ...
$ time ./gen.exe 1000 | ./1827.exe > /dev/null
real    0m2.969s
user    0m0.000s
sys     0m0.045s

这是堆配置文件摘要：

$ ghc -fforce-recomp -rtsopts ./1827.hs
[1 of 1] Compiling Main             ( 1827.hs, 1827.o )
Linking 1827.exe ...
$ ./gen.exe 1000 | ./1827.exe +RTS -s > /dev/null
      70,207,096 bytes allocated in the heap
       2,112,416 bytes copied during GC
         613,368 bytes maximum residency (3 sample(s))
          28,816 bytes maximum slop
               3 MB total memory in use (0 MB lost due to fragmentation)
                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0       132 colls,     0 par    0.00s    0.00s     0.0000s    0.0004s
  Gen  1         3 colls,     0 par    0.00s    0.00s     0.0006s    0.0010s
  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    0.03s  (  0.03s elapsed)
  GC      time    0.00s  (  0.01s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    0.03s  (  0.04s elapsed)
  %GC     time       0.0%  (14.7% elapsed)
  Alloc rate    2,250,213,011 bytes per MUT second
  Productivity 100.0% of total user, 83.1% of total elapsed
$ ghc -fforce-recomp -O -rtsopts ./1827.hs
[1 of 1] Compiling Main             ( 1827.hs, 1827.o )
Linking 1827.exe ...
$ ./gen.exe 1000 | ./1827.exe +RTS -s > /dev/null
   6,009,233,608 bytes allocated in the heap
     622,682,200 bytes copied during GC
         443,240 bytes maximum residency (505 sample(s))
          48,256 bytes maximum slop
               3 MB total memory in use (0 MB lost due to fragmentation)
                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     10945 colls,     0 par    0.72s    0.63s     0.0001s    0.0004s
  Gen  1       505 colls,     0 par    0.16s    0.13s     0.0003s    0.0005s
  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    2.00s  (  2.13s elapsed)
  GC      time    0.87s  (  0.76s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    2.89s  (  2.90s elapsed)
  %GC     time      30.3%  (26.4% elapsed)
  Alloc rate    3,009,412,603 bytes per MUT second
  Productivity  69.7% of total user, 69.4% of total elapsed

Answer 1

使用 `-O`

你的代码发生了什么

让我放大你的主要功能，并稍微重写它：

main :: IO ()
main = do
    [n, m] <- fmap (map read . words) getLine
    line <- getLine
    let nodes = listArray (0, n) . tonodes n . map (subtract 1) . map read . words $ line
    replicateM_ m $ query n nodes

显然，这里的意图是 NodeArray 创建一次，然后在 query.

的每个 m 调用中使用

不幸的是，GHC 将此代码有效地转换为

main = do
    [n, m] <- fmap (map read . words) getLine
    line <- getLine
    replicateM_ m $ do
        let nodes = listArray (0, n) . tonodes n . map (subtract 1) . map read . words $ line
        query n nodes

您可以立即在此处看到问题。

什么是状态黑客，为什么它会破坏我的程序性能

原因是状态 hack，它（大致）说：“当某些东西属于 IO a 类型时，假设它只被调用一次。”。 official documentation 就不多说了：

-fno-state-hack

Turn off the "state hack" whereby any lambda with a State# token as argument is considered to be single-entry, hence it is considered OK to inline things inside it. This can improve performance of IO and ST monad code, but it runs the risk of reducing sharing.

粗略地说，思路如下：如果你定义一个带有 IO 类型和 where 子句的函数，例如

foo x = do
    putStrLn y
    putStrLn y
  where y = ...x...

IO a 类型的内容可以被视为 RealWord -> (a, RealWorld) 类型的内容。在那个视图中，上面变成了（大致）

foo x = 
   let y = ...x... in 
   \world1 ->
     let (world2, ()) = putStrLn y world1
     let (world3, ()) = putStrLn y world2
     in  (world3, ())

对 foo 的调用（通常）看起来像这样 foo argument world。但是 foo 的定义只接受一个参数，另一个仅在稍后由本地 lambda 表达式使用！这将是对 foo 的非常缓慢的调用。如果代码看起来像这样会更快：

foo x world1 = 
   let y = ...x... in 
   let (world2, ()) = putStrLn y world1
   let (world3, ()) = putStrLn y world2
   in  (world3, ())

这被称为 eta-expansion 并基于各种理由完成（例如 analyzing the function’s definition, by checking how it is being called，并且 - 在这种情况下 - 类型定向启发式）。

不幸的是，如果对 foo 的调用实际上是 let fooArgument = foo argument 的形式，即带有一个参数，但还没有传递 world，这会降低性能。在原来的代码中，如果 fooArgument 然后被多次使用， y 仍然只会计算一次，并共享。在修改后的代码中，y 每次都会重新计算 - 正是您的 nodes.

发生了什么

事情可以解决吗？

可能吧。请参阅 #9388 尝试这样做。修复它的问题在于，它会在很多情况下会降低性能，即使编译器无法确定这一点，但转换恰好是正确的。并且可能存在技术上不可行的情况，即共享丢失，但它仍然是有益的，因为更快的调用带来的加速超过了重新计算的额外成本。所以不清楚从这里去哪里。

为什么这个 Haskell 代码运行使用 -O 会变慢？

Why does this Haskell code run slower with -O?

optimization

haskell

ghc

compiler-bug

使用 `-O`

什么是状态黑客，为什么它会破坏我的程序性能

事情可以解决吗？

为什么这个 Haskell 代码 运行 使用 -O 会变慢？

Why does this Haskell code run slower with -O?

optimization

haskell

ghc

compiler-bug

使用 -O

什么是状态黑客，为什么它会破坏我的程序性能

事情可以解决吗？

为什么这个 Haskell 代码运行使用 -O 会变慢？

使用 `-O`