haskell 从 bytestring 读取成对的向量非常慢,如何让它更快?
haskell reading a vector of pairs from bytestring very slow, how to make it faster?
我正在尝试从二进制文件中读取大量自定义数据类型向量。我尝试使用 example given here.
示例代码的问题在于,它使用列表,而我想使用向量。
所以我按如下方式调整了该代码,但是读取 1 MB 的文件需要很长时间(超过一分钟,之后我放弃了)。
module Main where
import Data.Word
import qualified Data.ByteString.Lazy as BIN
import Data.Binary.Get
import qualified Data.Vector.Unboxed as Vec
main = do
b <- BIN.readFile "dat.bin" -- about 1 MB size file
let v = runGet getPairs (BIN.tail b) -- skip the first byte
putStrLn $ show $ Vec.length v
getPair :: Get (Word8, Word8)
getPair = do
price <- getWord8
qty <- getWord8
return (price, qty)
getPairs :: Get (Vec.Vector (Word8, Word8))
getPairs = do
empty <- isEmpty
if empty
then return Vec.empty
else do pair <- getPair
pairs <- getPairs
return (Vec.cons pair pairs) -- is it slow because V.cons is O(n)?
当我尝试使用 ghc --make -O2 pairs.hs
运行 时,我得到了错误 Stack space overflow: current size ...
如何高效地将字节串中的值对读入向量?
同样,我希望获得完整的工作代码,而不仅仅是指向 Haskell 站点或 RWH 的指针,也不仅仅是 function/module 名称的列表。
这里有几个从文件创建矢量的例子。它们不是最高效的,但在 ghci 中只需几秒钟 运行。
module Main where
import qualified Data.ByteString.Lazy as BIN
import qualified Data.ByteString as BS
import qualified Data.Vector.Unboxed as Vec
import System.IO
import System.Posix
getFileSize :: String -> IO Int
getFileSize path = do
stat <- getFileStatus path
return (fromEnum $ fileSize stat)
readVector1 path = do
size <- getFileSize path
withBinaryFile path ReadMode $ \h -> do
-- can also use: size <- hFileSize h
let go _ = do bs <- BS.hGet h 2
return (BS.index bs 0, BS.index bs 1)
Vec.generateM (div size 2) go
pairs (a:b:rest) = (a,b) : pairs rest
pairs _ = []
readVector2 path = do
contents <- BIN.readFile path
-- unfoldr :: Unbox a => (b -> Maybe (a, b)) -> b -> Vector a
let v = Vec.unfoldr go (pairs $ BIN.unpack contents)
where go [] = Nothing
go (p:ps) = Just (p, ps)
return v
main = do
v <- readVector1 "rand" -- large file
print $ Vec.length v
v <- readVector2 "rand"
print $ Vec.length v
第三种选择:
readVector3 path = do
contents <- BS.readFile path
let size = BS.length contents
v = Vec.generate (div (fromIntegral size) 2) go
where go i = let a = BS.index contents (2*i)
b = BS.index contents (2*i+1)
in (a,b)
return v
事实证明这个是三个中最快的。
这是加载矢量的另一种方法,它使用 pipes
和 pipes-bytestring
流式传输文件,并使用 foldl
中的 vector
函数创建矢量:
{-# LANGUAGE PackageImports #-}
import Data.Functor (void)
import "pipes" Pipes
import qualified "pipes" Pipes.Prelude as P
import qualified "pipes-bytestring" Pipes.ByteString as B
import qualified "pipes-binary" Pipes.Binary as B
import qualified "vector" Data.Vector.Unboxed as V
import qualified "foldl" Control.Foldl as L
import "lens-family-core" Lens.Family (view)
import System.IO
main :: IO ()
main = do
v <- withBinaryFile "somefile" ReadMode (\h ->
-- for simplicity, errors are ignored with "void"
L.impurely P.foldM L.vector (void (view B.decoded (B.drop 1 (B.fromHandle h)))))
print (V.length (v::V.Vector (B.Word8,B.Word8)))
cons
效率低下。 foldl
的 vector
采取的方法是使用 unsafeGrow
, in order to accomodate incoming values, and at the end "trim" any excess capacity with unsafeTake
逐步将向量的容量加倍。
我正在尝试从二进制文件中读取大量自定义数据类型向量。我尝试使用 example given here.
示例代码的问题在于,它使用列表,而我想使用向量。 所以我按如下方式调整了该代码,但是读取 1 MB 的文件需要很长时间(超过一分钟,之后我放弃了)。
module Main where
import Data.Word
import qualified Data.ByteString.Lazy as BIN
import Data.Binary.Get
import qualified Data.Vector.Unboxed as Vec
main = do
b <- BIN.readFile "dat.bin" -- about 1 MB size file
let v = runGet getPairs (BIN.tail b) -- skip the first byte
putStrLn $ show $ Vec.length v
getPair :: Get (Word8, Word8)
getPair = do
price <- getWord8
qty <- getWord8
return (price, qty)
getPairs :: Get (Vec.Vector (Word8, Word8))
getPairs = do
empty <- isEmpty
if empty
then return Vec.empty
else do pair <- getPair
pairs <- getPairs
return (Vec.cons pair pairs) -- is it slow because V.cons is O(n)?
当我尝试使用 ghc --make -O2 pairs.hs
运行 时,我得到了错误 Stack space overflow: current size ...
如何高效地将字节串中的值对读入向量?
同样,我希望获得完整的工作代码,而不仅仅是指向 Haskell 站点或 RWH 的指针,也不仅仅是 function/module 名称的列表。
这里有几个从文件创建矢量的例子。它们不是最高效的,但在 ghci 中只需几秒钟 运行。
module Main where
import qualified Data.ByteString.Lazy as BIN
import qualified Data.ByteString as BS
import qualified Data.Vector.Unboxed as Vec
import System.IO
import System.Posix
getFileSize :: String -> IO Int
getFileSize path = do
stat <- getFileStatus path
return (fromEnum $ fileSize stat)
readVector1 path = do
size <- getFileSize path
withBinaryFile path ReadMode $ \h -> do
-- can also use: size <- hFileSize h
let go _ = do bs <- BS.hGet h 2
return (BS.index bs 0, BS.index bs 1)
Vec.generateM (div size 2) go
pairs (a:b:rest) = (a,b) : pairs rest
pairs _ = []
readVector2 path = do
contents <- BIN.readFile path
-- unfoldr :: Unbox a => (b -> Maybe (a, b)) -> b -> Vector a
let v = Vec.unfoldr go (pairs $ BIN.unpack contents)
where go [] = Nothing
go (p:ps) = Just (p, ps)
return v
main = do
v <- readVector1 "rand" -- large file
print $ Vec.length v
v <- readVector2 "rand"
print $ Vec.length v
第三种选择:
readVector3 path = do
contents <- BS.readFile path
let size = BS.length contents
v = Vec.generate (div (fromIntegral size) 2) go
where go i = let a = BS.index contents (2*i)
b = BS.index contents (2*i+1)
in (a,b)
return v
事实证明这个是三个中最快的。
这是加载矢量的另一种方法,它使用 pipes
和 pipes-bytestring
流式传输文件,并使用 foldl
中的 vector
函数创建矢量:
{-# LANGUAGE PackageImports #-}
import Data.Functor (void)
import "pipes" Pipes
import qualified "pipes" Pipes.Prelude as P
import qualified "pipes-bytestring" Pipes.ByteString as B
import qualified "pipes-binary" Pipes.Binary as B
import qualified "vector" Data.Vector.Unboxed as V
import qualified "foldl" Control.Foldl as L
import "lens-family-core" Lens.Family (view)
import System.IO
main :: IO ()
main = do
v <- withBinaryFile "somefile" ReadMode (\h ->
-- for simplicity, errors are ignored with "void"
L.impurely P.foldM L.vector (void (view B.decoded (B.drop 1 (B.fromHandle h)))))
print (V.length (v::V.Vector (B.Word8,B.Word8)))
cons
效率低下。 foldl
的 vector
采取的方法是使用 unsafeGrow
, in order to accomodate incoming values, and at the end "trim" any excess capacity with unsafeTake
逐步将向量的容量加倍。