Haskell 中的笛卡尔列表积(内存和速度)
Cartesian List Product in Haskell (Memory and Speed)
我正在尝试为笛卡尔积编写通用函数 cart :: [[a]] -> [[a]]
以生成从 0 到 7 的 9 元组数字集(9 个元素的列表而不是实际的 9 元组) .我写过几个语法相似的函数,但它们的性能却大相径庭。
cart :: [[a]] -> [[a]]
cart [] = [[]]
cart (xs:xss) = [x:ys | x <- xs, ys <- cart xss]
cart' :: [[a]] -> [[a]] -- Reverse binding
cart' [] = [[]]
cart' (xs:xss) = [x:ys | ys <- cart' xss, x <- xs]
xs = [0..7]
length $ cart $ replicate 9 xs -- This is slow (4.1s) and memory hungry (1916MB total); ~75% garbage collection time
length $ sequence $ replicate 9 xs -- Treating list as a monad; this is even slower (12s) and worse on memory (4214MB); ~75% garbage collection time
length $ cart' $ replicate 9 xs -- This is slightly faster (3.4s), and uses very little memory (2MB); ~1.5% garbage collection time
length [[a,b,c,d,e,f,g,h,i] | a <- xs, b <- xs, c <- xs, d <- xs, e <- xs, f <- xs, g <- xs, h <- xs, i <- xs] -- This is ultra-fast (0.5s) and uses virtually no memory (1MB); 0% garbage collection time
有没有一种方法可以编写一个速度和内存效率与第四个实现大致相同的通用笛卡尔积函数?更重要的是,为什么这些实现差异如此之大?
Similar thread
定义如下:
xs :: [Int]
xs = [0..7]
prodCart :: [[a]] -> [[a]]
prodCart [] = [[]]
prodCart (xs:xss) = concatMap (\xs' -> map (:xs') xs) (prodCart xss)
main :: IO ()
main = print $ length $
prodCart $ replicate 9 xs
我的成绩是
134217728
---
13,345,129,576 bytes allocated in the heap
13,825,832 bytes copied during GC
44,312 bytes maximum residency (4 sample(s))
21,224 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
./cart +RTS -s 1.69s user 0.06s system 98% cpu 1.773 total
关键的见解是,因为你只需要长度,如果cart
是有生产力的,那么最大驻留应该很小,即它可以一个一个地生成结果列表的元素。你自己写的open definition(最后一个)是这样的(同样44312 residency,3.913 total time)。
在 cart
定义中,prodCart xss
子列表很大,并且它会根据您的定义保存在内存中。在这种情况下最好防止共享。
如果我们将 reverse
放入该位置,这将阻止 length
的有效计算:那么数字将会改变:顺序将是 prodCart
、cart
, 手展开。试试你自己!
main :: IO ()
main = print $ length $ reverse $
prodCart $ replicate 8 xs
有
16777216
---
2,070,840,336 bytes allocated in the heap
3,175,246,728 bytes copied during GC
716,199,432 bytes maximum residency (12 sample(s))
13,690,376 bytes maximum slop
1411 MB total memory in use (0 MB lost due to fragmentation)
./cart +RTS -s 2.88s user 0.53s system 98% cpu 3.472 total
在我的机器上。
编辑 我整理了一个标准脚本(在进行微基准测试时你应该经常这样做)
{-# LANGUAGE ScopedTypeVariables #-}
-- stack --resolver=lts-6.0 install criterion
-- stack --resolver=lts-6.0 ghc -- -Wall -O2 cart.hs
import Criterion.Main
-- import Data.Traversable (sequenceA)
cart :: [[a]] -> [[a]]
cart [] = [[]]
cart (xs:xss) = [x:ys | x <- xs, ys <- cart xss]
cart' :: [[a]] -> [[a]] -- Reverse binding
cart' [] = [[]]
cart' (xs:xss) = [x:ys | ys <- cart' xss, x <- xs]
prodCart :: [[a]] -> [[a]]
prodCart [] = [[]]
prodCart (xs:xss) = concatMap (\xs' -> map (:xs') xs) (prodCart xss)
prodCartRepl :: forall a. [a] -> Int -> [[a]]
prodCartRepl xs = go
where
go :: Int -> [[a]]
go 0 = [[]]
go n = concatMap (\xs' -> map (:xs') xs) (go (n - 1))
handrolled :: [a] -> [[a]]
handrolled xs = [[a,b,c,d,e] | a <- xs, b <- xs, c <- xs, d <- xs, e <- xs ]
digits :: [Int]
digits = [0..7]
main :: IO ()
main = defaultMain
[ bgroup "length"
[ bench "cart" $ whnf (length . cart) (replicate 5 digits)
, bench "cart'" $ whnf (length . cart') (replicate 5 digits)
, bench "sequence" $ whnf (length . sequence) (replicate 5 digits)
, bench "sequenceA" $ whnf (length . sequenceA) (replicate 5 digits)
, bench "prodCart" $ whnf (length . prodCart) (replicate 5 digits)
-- Obviously different!
, bench "prodCartRepl" $ whnf (length . flip prodCartRepl 5) digits
, bench "handrolled" $ whnf (length . handrolled) digits
]
, bgroup "all"
[ bench "cart" $ nf cart (replicate 5 digits)
, bench "cart'" $ nf cart' (replicate 5 digits)
, bench "sequence" $ nf sequence (replicate 5 digits)
, bench "sequenceA" $ nf sequenceA (replicate 5 digits)
, bench "prodCart" $ nf prodCart (replicate 5 digits)
-- Obviously different!
, bench "prodCartRepl" $ nf (flip prodCartRepl 5) digits
, bench "handrolled" $ nf handrolled digits
]
]
有趣的是,当我将 "all" 部分注释掉时,handrolled 的结果将比“prodCart:
好 4 倍
benchmarking length/handrolled
time 141.8 μs (140.5 μs .. 143.0 μs)
0.999 R² (0.999 R² .. 1.000 R²)
mean 141.1 μs (140.1 μs .. 142.0 μs)
std dev 3.203 μs (2.657 μs .. 4.091 μs)
variance introduced by outliers: 17% (moderately inflated)
然而,如果我添加 module Main (main, handrolled, prodCart) where
header,那么它们会变得更糟(2.8s);如果我 {-# INLINE handrolled #-}
再次下降到 140μs
会更好。其他定义不能被内联,因为它们是递归的,并且在不知道递归深度的情况下,定义不能被内联足够多次以 "unroll" 循环。似乎当 handrolled
内联并与 length
组合时,它只是计算项目(长度与列表生产融合),所以速度非常快。其他版本不会发生这种情况。
检查核心 (-ddump-simpl
) 应该可以揭示这一点,但我没有检查。
全部结果为:
benchmarking length/cart
time 885.1 μs (844.2 μs .. 934.4 μs)
0.978 R² (0.962 R² .. 0.991 R²)
mean 850.2 μs (828.3 μs .. 881.0 μs)
std dev 89.79 μs (65.94 μs .. 128.2 μs)
variance introduced by outliers: 76% (severely inflated)
benchmarking length/cart'
time 429.1 μs (417.7 μs .. 441.7 μs)
0.995 R² (0.992 R² .. 0.998 R²)
mean 437.2 μs (430.3 μs .. 447.3 μs)
std dev 26.67 μs (21.33 μs .. 33.43 μs)
variance introduced by outliers: 55% (severely inflated)
benchmarking length/sequence
time 1.006 ms (970.5 μs .. 1.057 ms)
0.971 R² (0.948 R² .. 0.988 R²)
mean 1.115 ms (1.075 ms .. 1.186 ms)
std dev 166.2 μs (120.7 μs .. 228.5 μs)
variance introduced by outliers: 86% (severely inflated)
benchmarking length/sequenceA
time 1.008 ms (977.5 μs .. 1.041 ms)
0.990 R² (0.982 R² .. 0.995 R²)
mean 1.050 ms (1.027 ms .. 1.080 ms)
std dev 90.05 μs (70.35 μs .. 114.8 μs)
variance introduced by outliers: 66% (severely inflated)
benchmarking length/prodCart
time 435.7 μs (426.7 μs .. 445.2 μs)
0.996 R² (0.993 R² .. 0.998 R²)
mean 435.6 μs (429.1 μs .. 443.3 μs)
std dev 23.63 μs (19.21 μs .. 29.16 μs)
variance introduced by outliers: 49% (moderately inflated)
benchmarking length/prodCartRepl
time 454.7 μs (424.3 μs .. 502.9 μs)
0.968 R² (0.947 R² .. 0.994 R²)
mean 448.6 μs (435.2 μs .. 466.2 μs)
std dev 51.97 μs (37.25 μs .. 71.59 μs)
variance introduced by outliers: 82% (severely inflated)
benchmarking length/handrolled
time 142.8 μs (141.0 μs .. 145.8 μs)
0.998 R² (0.996 R² .. 0.999 R²)
mean 143.8 μs (142.6 μs .. 145.8 μs)
std dev 5.080 μs (3.776 μs .. 7.583 μs)
variance introduced by outliers: 33% (moderately inflated)
benchmarking all/cart
time 2.123 ms (2.050 ms .. 2.212 ms)
0.977 R² (0.955 R² .. 0.993 R²)
mean 2.035 ms (1.981 ms .. 2.129 ms)
std dev 227.1 μs (156.5 μs .. 335.3 μs)
variance introduced by outliers: 72% (severely inflated)
benchmarking all/cart'
time 1.278 ms (1.245 ms .. 1.318 ms)
0.986 R² (0.971 R² .. 0.996 R²)
mean 1.339 ms (1.301 ms .. 1.393 ms)
std dev 157.7 μs (105.3 μs .. 218.0 μs)
variance introduced by outliers: 77% (severely inflated)
benchmarking all/sequence
time 1.772 ms (1.726 ms .. 1.833 ms)
0.989 R² (0.976 R² .. 0.998 R²)
mean 1.799 ms (1.765 ms .. 1.854 ms)
std dev 148.9 μs (90.60 μs .. 234.2 μs)
variance introduced by outliers: 61% (severely inflated)
benchmarking all/sequenceA
time 2.058 ms (1.979 ms .. 2.143 ms)
0.988 R² (0.982 R² .. 0.993 R²)
mean 1.903 ms (1.859 ms .. 1.952 ms)
std dev 157.6 μs (131.6 μs .. 189.2 μs)
variance introduced by outliers: 61% (severely inflated)
benchmarking all/prodCart
time 1.367 ms (1.303 ms .. 1.438 ms)
0.988 R² (0.980 R² .. 0.996 R²)
mean 1.349 ms (1.324 ms .. 1.396 ms)
std dev 118.0 μs (74.92 μs .. 198.2 μs)
variance introduced by outliers: 65% (severely inflated)
benchmarking all/prodCartRepl
time 1.331 ms (1.294 ms .. 1.381 ms)
0.992 R² (0.988 R² .. 0.997 R²)
mean 1.350 ms (1.328 ms .. 1.379 ms)
std dev 84.37 μs (63.50 μs .. 116.7 μs)
variance introduced by outliers: 49% (moderately inflated)
benchmarking all/handrolled
time 3.552 ms (3.455 ms .. 3.711 ms)
0.986 R² (0.972 R² .. 0.996 R²)
mean 3.631 ms (3.547 ms .. 3.724 ms)
std dev 281.9 μs (226.9 μs .. 349.7 μs)
variance introduced by outliers: 51% (severely inflated)
正如您从 所有 基准测试中看到的那样,当您真正想要结果时,手动理解实际上并没有那么快,而不仅仅是产生结果。
我正在尝试为笛卡尔积编写通用函数 cart :: [[a]] -> [[a]]
以生成从 0 到 7 的 9 元组数字集(9 个元素的列表而不是实际的 9 元组) .我写过几个语法相似的函数,但它们的性能却大相径庭。
cart :: [[a]] -> [[a]]
cart [] = [[]]
cart (xs:xss) = [x:ys | x <- xs, ys <- cart xss]
cart' :: [[a]] -> [[a]] -- Reverse binding
cart' [] = [[]]
cart' (xs:xss) = [x:ys | ys <- cart' xss, x <- xs]
xs = [0..7]
length $ cart $ replicate 9 xs -- This is slow (4.1s) and memory hungry (1916MB total); ~75% garbage collection time
length $ sequence $ replicate 9 xs -- Treating list as a monad; this is even slower (12s) and worse on memory (4214MB); ~75% garbage collection time
length $ cart' $ replicate 9 xs -- This is slightly faster (3.4s), and uses very little memory (2MB); ~1.5% garbage collection time
length [[a,b,c,d,e,f,g,h,i] | a <- xs, b <- xs, c <- xs, d <- xs, e <- xs, f <- xs, g <- xs, h <- xs, i <- xs] -- This is ultra-fast (0.5s) and uses virtually no memory (1MB); 0% garbage collection time
有没有一种方法可以编写一个速度和内存效率与第四个实现大致相同的通用笛卡尔积函数?更重要的是,为什么这些实现差异如此之大?
Similar thread
定义如下:
xs :: [Int]
xs = [0..7]
prodCart :: [[a]] -> [[a]]
prodCart [] = [[]]
prodCart (xs:xss) = concatMap (\xs' -> map (:xs') xs) (prodCart xss)
main :: IO ()
main = print $ length $
prodCart $ replicate 9 xs
我的成绩是
134217728
---
13,345,129,576 bytes allocated in the heap
13,825,832 bytes copied during GC
44,312 bytes maximum residency (4 sample(s))
21,224 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
./cart +RTS -s 1.69s user 0.06s system 98% cpu 1.773 total
关键的见解是,因为你只需要长度,如果cart
是有生产力的,那么最大驻留应该很小,即它可以一个一个地生成结果列表的元素。你自己写的open definition(最后一个)是这样的(同样44312 residency,3.913 total time)。
在 cart
定义中,prodCart xss
子列表很大,并且它会根据您的定义保存在内存中。在这种情况下最好防止共享。
如果我们将 reverse
放入该位置,这将阻止 length
的有效计算:那么数字将会改变:顺序将是 prodCart
、cart
, 手展开。试试你自己!
main :: IO ()
main = print $ length $ reverse $
prodCart $ replicate 8 xs
有
16777216
---
2,070,840,336 bytes allocated in the heap
3,175,246,728 bytes copied during GC
716,199,432 bytes maximum residency (12 sample(s))
13,690,376 bytes maximum slop
1411 MB total memory in use (0 MB lost due to fragmentation)
./cart +RTS -s 2.88s user 0.53s system 98% cpu 3.472 total
在我的机器上。
编辑 我整理了一个标准脚本(在进行微基准测试时你应该经常这样做)
{-# LANGUAGE ScopedTypeVariables #-}
-- stack --resolver=lts-6.0 install criterion
-- stack --resolver=lts-6.0 ghc -- -Wall -O2 cart.hs
import Criterion.Main
-- import Data.Traversable (sequenceA)
cart :: [[a]] -> [[a]]
cart [] = [[]]
cart (xs:xss) = [x:ys | x <- xs, ys <- cart xss]
cart' :: [[a]] -> [[a]] -- Reverse binding
cart' [] = [[]]
cart' (xs:xss) = [x:ys | ys <- cart' xss, x <- xs]
prodCart :: [[a]] -> [[a]]
prodCart [] = [[]]
prodCart (xs:xss) = concatMap (\xs' -> map (:xs') xs) (prodCart xss)
prodCartRepl :: forall a. [a] -> Int -> [[a]]
prodCartRepl xs = go
where
go :: Int -> [[a]]
go 0 = [[]]
go n = concatMap (\xs' -> map (:xs') xs) (go (n - 1))
handrolled :: [a] -> [[a]]
handrolled xs = [[a,b,c,d,e] | a <- xs, b <- xs, c <- xs, d <- xs, e <- xs ]
digits :: [Int]
digits = [0..7]
main :: IO ()
main = defaultMain
[ bgroup "length"
[ bench "cart" $ whnf (length . cart) (replicate 5 digits)
, bench "cart'" $ whnf (length . cart') (replicate 5 digits)
, bench "sequence" $ whnf (length . sequence) (replicate 5 digits)
, bench "sequenceA" $ whnf (length . sequenceA) (replicate 5 digits)
, bench "prodCart" $ whnf (length . prodCart) (replicate 5 digits)
-- Obviously different!
, bench "prodCartRepl" $ whnf (length . flip prodCartRepl 5) digits
, bench "handrolled" $ whnf (length . handrolled) digits
]
, bgroup "all"
[ bench "cart" $ nf cart (replicate 5 digits)
, bench "cart'" $ nf cart' (replicate 5 digits)
, bench "sequence" $ nf sequence (replicate 5 digits)
, bench "sequenceA" $ nf sequenceA (replicate 5 digits)
, bench "prodCart" $ nf prodCart (replicate 5 digits)
-- Obviously different!
, bench "prodCartRepl" $ nf (flip prodCartRepl 5) digits
, bench "handrolled" $ nf handrolled digits
]
]
有趣的是,当我将 "all" 部分注释掉时,handrolled 的结果将比“prodCart:
好 4 倍benchmarking length/handrolled
time 141.8 μs (140.5 μs .. 143.0 μs)
0.999 R² (0.999 R² .. 1.000 R²)
mean 141.1 μs (140.1 μs .. 142.0 μs)
std dev 3.203 μs (2.657 μs .. 4.091 μs)
variance introduced by outliers: 17% (moderately inflated)
然而,如果我添加 module Main (main, handrolled, prodCart) where
header,那么它们会变得更糟(2.8s);如果我 {-# INLINE handrolled #-}
再次下降到 140μs
会更好。其他定义不能被内联,因为它们是递归的,并且在不知道递归深度的情况下,定义不能被内联足够多次以 "unroll" 循环。似乎当 handrolled
内联并与 length
组合时,它只是计算项目(长度与列表生产融合),所以速度非常快。其他版本不会发生这种情况。
检查核心 (-ddump-simpl
) 应该可以揭示这一点,但我没有检查。
全部结果为:
benchmarking length/cart
time 885.1 μs (844.2 μs .. 934.4 μs)
0.978 R² (0.962 R² .. 0.991 R²)
mean 850.2 μs (828.3 μs .. 881.0 μs)
std dev 89.79 μs (65.94 μs .. 128.2 μs)
variance introduced by outliers: 76% (severely inflated)
benchmarking length/cart'
time 429.1 μs (417.7 μs .. 441.7 μs)
0.995 R² (0.992 R² .. 0.998 R²)
mean 437.2 μs (430.3 μs .. 447.3 μs)
std dev 26.67 μs (21.33 μs .. 33.43 μs)
variance introduced by outliers: 55% (severely inflated)
benchmarking length/sequence
time 1.006 ms (970.5 μs .. 1.057 ms)
0.971 R² (0.948 R² .. 0.988 R²)
mean 1.115 ms (1.075 ms .. 1.186 ms)
std dev 166.2 μs (120.7 μs .. 228.5 μs)
variance introduced by outliers: 86% (severely inflated)
benchmarking length/sequenceA
time 1.008 ms (977.5 μs .. 1.041 ms)
0.990 R² (0.982 R² .. 0.995 R²)
mean 1.050 ms (1.027 ms .. 1.080 ms)
std dev 90.05 μs (70.35 μs .. 114.8 μs)
variance introduced by outliers: 66% (severely inflated)
benchmarking length/prodCart
time 435.7 μs (426.7 μs .. 445.2 μs)
0.996 R² (0.993 R² .. 0.998 R²)
mean 435.6 μs (429.1 μs .. 443.3 μs)
std dev 23.63 μs (19.21 μs .. 29.16 μs)
variance introduced by outliers: 49% (moderately inflated)
benchmarking length/prodCartRepl
time 454.7 μs (424.3 μs .. 502.9 μs)
0.968 R² (0.947 R² .. 0.994 R²)
mean 448.6 μs (435.2 μs .. 466.2 μs)
std dev 51.97 μs (37.25 μs .. 71.59 μs)
variance introduced by outliers: 82% (severely inflated)
benchmarking length/handrolled
time 142.8 μs (141.0 μs .. 145.8 μs)
0.998 R² (0.996 R² .. 0.999 R²)
mean 143.8 μs (142.6 μs .. 145.8 μs)
std dev 5.080 μs (3.776 μs .. 7.583 μs)
variance introduced by outliers: 33% (moderately inflated)
benchmarking all/cart
time 2.123 ms (2.050 ms .. 2.212 ms)
0.977 R² (0.955 R² .. 0.993 R²)
mean 2.035 ms (1.981 ms .. 2.129 ms)
std dev 227.1 μs (156.5 μs .. 335.3 μs)
variance introduced by outliers: 72% (severely inflated)
benchmarking all/cart'
time 1.278 ms (1.245 ms .. 1.318 ms)
0.986 R² (0.971 R² .. 0.996 R²)
mean 1.339 ms (1.301 ms .. 1.393 ms)
std dev 157.7 μs (105.3 μs .. 218.0 μs)
variance introduced by outliers: 77% (severely inflated)
benchmarking all/sequence
time 1.772 ms (1.726 ms .. 1.833 ms)
0.989 R² (0.976 R² .. 0.998 R²)
mean 1.799 ms (1.765 ms .. 1.854 ms)
std dev 148.9 μs (90.60 μs .. 234.2 μs)
variance introduced by outliers: 61% (severely inflated)
benchmarking all/sequenceA
time 2.058 ms (1.979 ms .. 2.143 ms)
0.988 R² (0.982 R² .. 0.993 R²)
mean 1.903 ms (1.859 ms .. 1.952 ms)
std dev 157.6 μs (131.6 μs .. 189.2 μs)
variance introduced by outliers: 61% (severely inflated)
benchmarking all/prodCart
time 1.367 ms (1.303 ms .. 1.438 ms)
0.988 R² (0.980 R² .. 0.996 R²)
mean 1.349 ms (1.324 ms .. 1.396 ms)
std dev 118.0 μs (74.92 μs .. 198.2 μs)
variance introduced by outliers: 65% (severely inflated)
benchmarking all/prodCartRepl
time 1.331 ms (1.294 ms .. 1.381 ms)
0.992 R² (0.988 R² .. 0.997 R²)
mean 1.350 ms (1.328 ms .. 1.379 ms)
std dev 84.37 μs (63.50 μs .. 116.7 μs)
variance introduced by outliers: 49% (moderately inflated)
benchmarking all/handrolled
time 3.552 ms (3.455 ms .. 3.711 ms)
0.986 R² (0.972 R² .. 0.996 R²)
mean 3.631 ms (3.547 ms .. 3.724 ms)
std dev 281.9 μs (226.9 μs .. 349.7 μs)
variance introduced by outliers: 51% (severely inflated)
正如您从 所有 基准测试中看到的那样,当您真正想要结果时,手动理解实际上并没有那么快,而不仅仅是产生结果。