haskell 从 bytestring 读取成对的向量非常慢,如何让它更快?

haskell reading a vector of pairs from bytestring very slow, how to make it faster?

我正在尝试从二进制文件中读取大量自定义数据类型向量。我尝试使用 example given here.

示例代码的问题在于,它使用列表,而我想使用向量。 所以我按如下方式调整了该代码,但是读取 1 MB 的文件需要很长时间(超过一分钟,之后我放弃了)。

module Main where

import Data.Word
import qualified Data.ByteString.Lazy as BIN
import Data.Binary.Get
import qualified Data.Vector.Unboxed as Vec

main = do
  b <- BIN.readFile "dat.bin" -- about 1 MB size file
  let v = runGet getPairs (BIN.tail b) -- skip the first byte
  putStrLn $ show $ Vec.length v

getPair :: Get (Word8, Word8)
getPair = do
  price <- getWord8
  qty <- getWord8
  return (price, qty)

getPairs :: Get (Vec.Vector (Word8, Word8))
getPairs = do
 empty <- isEmpty
 if empty
   then return Vec.empty
   else do pair  <- getPair
           pairs <- getPairs
           return (Vec.cons pair pairs) -- is it slow because V.cons is O(n)?

当我尝试使用 ghc --make -O2 pairs.hs 运行 时,我得到了错误 Stack space overflow: current size ...

如何高效地将字节串中的值对读入向量?

同样,我希望获得完整的工作代码,而不仅仅是指向 Haskell 站点或 RWH 的指针,也不仅仅是 function/module 名称的列表。

这里有几个从文件创建矢量的例子。它们不是最高效的,但在 ghci 中只需几秒钟 运行。

module Main where

import qualified Data.ByteString.Lazy as BIN
import qualified Data.ByteString as BS
import qualified Data.Vector.Unboxed as Vec
import System.IO
import System.Posix

getFileSize :: String -> IO Int
getFileSize path = do
    stat <- getFileStatus path
    return (fromEnum $ fileSize stat)

readVector1 path = do
  size <- getFileSize path
  withBinaryFile path ReadMode $ \h -> do
    -- can also use: size <- hFileSize h
    let go _ = do bs <- BS.hGet h 2 
                  return (BS.index bs 0, BS.index bs 1)
    Vec.generateM (div size 2) go

pairs (a:b:rest) = (a,b) : pairs rest
pairs _          = []

readVector2 path = do
  contents <- BIN.readFile path
  -- unfoldr :: Unbox a => (b -> Maybe (a, b)) -> b -> Vector a
  let v = Vec.unfoldr go (pairs $ BIN.unpack contents)
        where go [] = Nothing
              go (p:ps) = Just (p, ps)
  return v

main = do
  v <- readVector1 "rand" -- large file
  print $ Vec.length v
  v <- readVector2 "rand"
  print $ Vec.length v

第三种选择:

readVector3 path = do
  contents <- BS.readFile path
  let size = BS.length contents
      v = Vec.generate (div (fromIntegral size) 2) go
            where go i = let a = BS.index contents (2*i)
                             b = BS.index contents (2*i+1)
                         in (a,b)
  return v

事实证明这个是三个中最快的。

这是加载矢量的另一种方法,它使用 pipespipes-bytestring 流式传输文件,并使用 foldl 中的 vector 函数创建矢量:

{-# LANGUAGE PackageImports #-}
import Data.Functor (void)
import "pipes" Pipes
import qualified "pipes" Pipes.Prelude as P
import qualified "pipes-bytestring" Pipes.ByteString as B
import qualified "pipes-binary" Pipes.Binary as B
import qualified "vector" Data.Vector.Unboxed as V
import qualified "foldl" Control.Foldl as L
import "lens-family-core" Lens.Family (view)
import System.IO

main :: IO ()
main = do
    v <- withBinaryFile "somefile" ReadMode (\h ->
        -- for simplicity, errors are ignored with "void"
        L.impurely P.foldM L.vector (void (view B.decoded (B.drop 1 (B.fromHandle h)))))
    print (V.length (v::V.Vector (B.Word8,B.Word8)))

cons 效率低下。 foldlvector 采取的方法是使用 unsafeGrow, in order to accomodate incoming values, and at the end "trim" any excess capacity with unsafeTake 逐步将向量的容量加倍。