Haskell:向后读取一个二进制文件
Haskell: Read a binary file backwards
我正在使用 Haskell 在 uInt32 二进制转储中查找匹配特定模式的最后一个 32 位字。我可以使用 last
完成任务,但是代码必须遍历整个文件,因此效率很低。
有没有简单的方法让readfile
通过文件反向操作?我相信这将通过对当前代码进行最小的更改来解决问题。
这是我目前的代码,供参考。我这个周末才开始使用 Haskell,所以我确信它非常难看。它在 MSB 处查找以 0b10 开头的最后一个 32 位字。
import System.Environment(getArgs)
import qualified Data.ByteString.Lazy as BL
import qualified Data.ByteString.Lazy.Internal as BL
import qualified Data.ByteString as BS
import Data.Binary.Get
import Data.Word
import Data.Bits
import Text.Printf(printf)
main = do
args <- getArgs
let file = args!!0
putStrLn $ "Find last 0xCXXXXXXX in " ++ file
content <- BL.readFile file
let packets = getPackets content
putStrLn . show . getValue . last . filterTimes $ packets
-- Data
type Packet = Word32
-- filter where first 2 bits are 10
filterTimes :: [Packet] -> [Packet]
filterTimes = filter ((== 0x2) . tag)
-- get the first 2 bits
tag :: Packet -> Packet
tag rp =
let tagSize = 2
in shiftR rp (finiteBitSize rp - tagSize)
-- remove the tag bits
getValue :: Packet -> Packet
getValue =
let tagSize = 2
mask = complement $ rotateR (2^tagSize - 1) tagSize
in (.&.) mask
-- Input
-- Based on https://hackage.haskell.org/package/binary/docs/Data-Binary-Get.html
getPacket :: Get Packet
getPacket = do
packet <- getWord32le
return $! packet
getPackets :: BL.ByteString -> [Packet]
getPackets input0 = go decoder input0
where
decoder = runGetIncremental getPacket
go :: Decoder Packet -> BL.ByteString -> [Packet]
go (Done leftover _consumed packet) input =
packet : go decoder (BL.chunk leftover input)
go (Partial k) input =
go (k . takeHeadChunk $ input) (dropHeadChunk input)
go (Fail _leftover _consumed msg) _input =
[]
takeHeadChunk :: BL.ByteString -> Maybe BS.ByteString
takeHeadChunk lbs =
case lbs of
(BL.Chunk bs _) -> Just bs
_ -> Nothing
dropHeadChunk :: BL.ByteString -> BL.ByteString
dropHeadChunk lbs =
case lbs of
(BL.Chunk _ lbs') -> lbs'
_ -> BL.Empty
对您的代码的一些评论:
您正在使用 last
,这可能会引发异常。您应该使用 safe 包中的 lastMay
,其中 returns 可能是
由于您只是将文件视为 Word32 的向量,我认为不值得使用 Data.Binary.Get 以及它带来的相关开销和复杂性。只需将文件视为(可能是惰性的)ByteString 并访问每 4 个字节或将其分解为 4 字节的子字符串。
您可以查看使用 ByteStrings here 的代码。它实现了以下解决问题的方法:
将整个文件作为惰性 ByteString 读入并生成 4 字节子字符串的(惰性)列表。 Return最后一个满足条件的子串。
intoWords :: BL.ByteString -> [ BL.ByteString ]
intoWords bs
| BL.null a = []
| otherwise = a : intoWords b
where (a,b) = BL.splitAt 4 bs
-- find by breaking the file into 4-byte words
find_C0_v1 :: FilePath -> IO (Maybe BL.ByteString)
find_C0_v1 path = do
contents <- BL.readFile path
return $ lastMay . filter (\bs -> BL.index bs 0 == 0xC0) . intoWords $ contents
将整个文件作为惰性字节串读入,并访问每 4 个字节以查找 0xC0。 Return 最后一次出现。
-- find by looking at every 4th byte
find_C0_v2 :: FilePath -> IO (Maybe BL.ByteString)
find_C0_v2 path = do
contents <- BL.readFile path
size <- fmap fromIntegral $ withFile path ReadMode hFileSize
let wordAt i = BL.take 4 . BL.drop i $ contents
return $ fmap wordAt $ lastMay $ filter (\i -> BL.index contents i == 0xC0) [0,4..size-1]
以 64K 的块向后读取文件。在每个块(这是一个严格的 ByteString)中,每 4 个字节访问一次,以查找从块末尾开始的 0xC0。 Return 第一次出现。
-- read a file backwords until a predicate returns a Just value
loopBlocks :: Int -> Handle -> Integer -> (BS.ByteString -> Integer -> Maybe a) -> IO (Maybe a)
loopBlocks blksize h top pred
| top <= 0 = return Nothing
| otherwise = do
let offset = top - fromIntegral blksize
hSeek h AbsoluteSeek offset
blk <- BS.hGet h blksize
case pred blk offset of
Nothing -> loopBlocks blksize h offset pred
x -> return x
-- find by reading backwords lookint at every 4th byte
find_C0_v3 :: FilePath -> IO (Maybe Integer)
find_C0_v3 path = do
withFile path ReadMode $ \h -> do
size <- hFileSize h
let top = size - (mod size 4)
blksize = 64*1024 :: Int
loopBlocks blksize h top $ \blk offset ->
fmap ( (+offset) . fromIntegral ) $ headMay $ filter (\i -> BS.index blk i == 0xC0) [blksize-4,blksize-8..0]
第三种方法是最快的,即使它必须读入整个文件。第一种方法实际上效果很好。我根本不推荐第二种——它的性能会随着文件大小的增长而急剧下降。
对于任何其他可能感兴趣的人,我已经改编了@ErikR 的回答。此解决方案遵循他提出的解决方案 3,但通过懒惰地反向步进块来利用我现有的代码。
这需要一些额外的导入:
import System.IO
import Safe
import Data.Maybe
main
变为:
main = do
args <- getArgs
let file = args!!0
putStrLn $ "Find last 0xCXXXXXXX in " ++ file
-- forward
withFile file ReadMode $ \h -> do
content <- BL.hGetContents h
let packets = getPackets content
putStrLn . show . getValue . last . filterTimes $ packets
-- reverse
withFile file ReadMode $ \h -> do
size <- hFileSize h
let blksize = 64*1024 :: Int
chunks <- makeReverseChunks blksize h (fromIntegral size)
putStrLn . show . getValue . (fromMaybe 0) . headMay . catMaybes . (map $ lastMay . filterTimes . getPackets) $ chunks
添加了辅助函数:
-- create list of data chunks, backwards in order through the file
makeReverseChunks :: Int -> Handle -> Int -> IO [BL.ByteString]
makeReverseChunks blksize h top
| top == 0 = return []
| top < 0 = error "negative file index"
| otherwise = do
let offset = max (top - fromIntegral blksize) 0
hSeek h AbsoluteSeek (fromIntegral offset)
blk <- BL.hGet h blksize
rest <- makeReverseChunks blksize h offset
return $ blk : rest
这里是函数的变体 makeReverseChunks
。目前是相当严格的。此外,如果将 blksize 保持得足够低,使用惰性字节串也无济于事。为了实现惰性阅读,必须使用unsafeInterleaveIO
。这是一个使用严格字节串和惰性 IO 的解决方案:
-- create list of data chunks, backwards in order through the file
makeReverseChunks :: Int -> Handle -> Int -> IO [SBS.ByteString]
makeReverseChunks blksize h top
| top == 0 = return []
| top < 0 = error "negative file index"
| otherwise = do
let offset = max (top - fromIntegral blksize) 0
hSeek h AbsoluteSeek (fromIntegral offset)
blk <- SBS.hGet h blksize
rest <- unsafeInterleaveIO $ makeReverseChunks blksize h offset
return $ blk : rest
我正在使用 Haskell 在 uInt32 二进制转储中查找匹配特定模式的最后一个 32 位字。我可以使用 last
完成任务,但是代码必须遍历整个文件,因此效率很低。
有没有简单的方法让readfile
通过文件反向操作?我相信这将通过对当前代码进行最小的更改来解决问题。
这是我目前的代码,供参考。我这个周末才开始使用 Haskell,所以我确信它非常难看。它在 MSB 处查找以 0b10 开头的最后一个 32 位字。
import System.Environment(getArgs)
import qualified Data.ByteString.Lazy as BL
import qualified Data.ByteString.Lazy.Internal as BL
import qualified Data.ByteString as BS
import Data.Binary.Get
import Data.Word
import Data.Bits
import Text.Printf(printf)
main = do
args <- getArgs
let file = args!!0
putStrLn $ "Find last 0xCXXXXXXX in " ++ file
content <- BL.readFile file
let packets = getPackets content
putStrLn . show . getValue . last . filterTimes $ packets
-- Data
type Packet = Word32
-- filter where first 2 bits are 10
filterTimes :: [Packet] -> [Packet]
filterTimes = filter ((== 0x2) . tag)
-- get the first 2 bits
tag :: Packet -> Packet
tag rp =
let tagSize = 2
in shiftR rp (finiteBitSize rp - tagSize)
-- remove the tag bits
getValue :: Packet -> Packet
getValue =
let tagSize = 2
mask = complement $ rotateR (2^tagSize - 1) tagSize
in (.&.) mask
-- Input
-- Based on https://hackage.haskell.org/package/binary/docs/Data-Binary-Get.html
getPacket :: Get Packet
getPacket = do
packet <- getWord32le
return $! packet
getPackets :: BL.ByteString -> [Packet]
getPackets input0 = go decoder input0
where
decoder = runGetIncremental getPacket
go :: Decoder Packet -> BL.ByteString -> [Packet]
go (Done leftover _consumed packet) input =
packet : go decoder (BL.chunk leftover input)
go (Partial k) input =
go (k . takeHeadChunk $ input) (dropHeadChunk input)
go (Fail _leftover _consumed msg) _input =
[]
takeHeadChunk :: BL.ByteString -> Maybe BS.ByteString
takeHeadChunk lbs =
case lbs of
(BL.Chunk bs _) -> Just bs
_ -> Nothing
dropHeadChunk :: BL.ByteString -> BL.ByteString
dropHeadChunk lbs =
case lbs of
(BL.Chunk _ lbs') -> lbs'
_ -> BL.Empty
对您的代码的一些评论:
您正在使用
last
,这可能会引发异常。您应该使用 safe 包中的lastMay
,其中 returns 可能是由于您只是将文件视为 Word32 的向量,我认为不值得使用 Data.Binary.Get 以及它带来的相关开销和复杂性。只需将文件视为(可能是惰性的)ByteString 并访问每 4 个字节或将其分解为 4 字节的子字符串。
您可以查看使用 ByteStrings here 的代码。它实现了以下解决问题的方法:
将整个文件作为惰性 ByteString 读入并生成 4 字节子字符串的(惰性)列表。 Return最后一个满足条件的子串。
intoWords :: BL.ByteString -> [ BL.ByteString ] intoWords bs | BL.null a = [] | otherwise = a : intoWords b where (a,b) = BL.splitAt 4 bs -- find by breaking the file into 4-byte words find_C0_v1 :: FilePath -> IO (Maybe BL.ByteString) find_C0_v1 path = do contents <- BL.readFile path return $ lastMay . filter (\bs -> BL.index bs 0 == 0xC0) . intoWords $ contents
将整个文件作为惰性字节串读入,并访问每 4 个字节以查找 0xC0。 Return 最后一次出现。
-- find by looking at every 4th byte find_C0_v2 :: FilePath -> IO (Maybe BL.ByteString) find_C0_v2 path = do contents <- BL.readFile path size <- fmap fromIntegral $ withFile path ReadMode hFileSize let wordAt i = BL.take 4 . BL.drop i $ contents return $ fmap wordAt $ lastMay $ filter (\i -> BL.index contents i == 0xC0) [0,4..size-1]
以 64K 的块向后读取文件。在每个块(这是一个严格的 ByteString)中,每 4 个字节访问一次,以查找从块末尾开始的 0xC0。 Return 第一次出现。
-- read a file backwords until a predicate returns a Just value loopBlocks :: Int -> Handle -> Integer -> (BS.ByteString -> Integer -> Maybe a) -> IO (Maybe a) loopBlocks blksize h top pred | top <= 0 = return Nothing | otherwise = do let offset = top - fromIntegral blksize hSeek h AbsoluteSeek offset blk <- BS.hGet h blksize case pred blk offset of Nothing -> loopBlocks blksize h offset pred x -> return x -- find by reading backwords lookint at every 4th byte find_C0_v3 :: FilePath -> IO (Maybe Integer) find_C0_v3 path = do withFile path ReadMode $ \h -> do size <- hFileSize h let top = size - (mod size 4) blksize = 64*1024 :: Int loopBlocks blksize h top $ \blk offset -> fmap ( (+offset) . fromIntegral ) $ headMay $ filter (\i -> BS.index blk i == 0xC0) [blksize-4,blksize-8..0]
第三种方法是最快的,即使它必须读入整个文件。第一种方法实际上效果很好。我根本不推荐第二种——它的性能会随着文件大小的增长而急剧下降。
对于任何其他可能感兴趣的人,我已经改编了@ErikR 的回答。此解决方案遵循他提出的解决方案 3,但通过懒惰地反向步进块来利用我现有的代码。
这需要一些额外的导入:
import System.IO
import Safe
import Data.Maybe
main
变为:
main = do
args <- getArgs
let file = args!!0
putStrLn $ "Find last 0xCXXXXXXX in " ++ file
-- forward
withFile file ReadMode $ \h -> do
content <- BL.hGetContents h
let packets = getPackets content
putStrLn . show . getValue . last . filterTimes $ packets
-- reverse
withFile file ReadMode $ \h -> do
size <- hFileSize h
let blksize = 64*1024 :: Int
chunks <- makeReverseChunks blksize h (fromIntegral size)
putStrLn . show . getValue . (fromMaybe 0) . headMay . catMaybes . (map $ lastMay . filterTimes . getPackets) $ chunks
添加了辅助函数:
-- create list of data chunks, backwards in order through the file
makeReverseChunks :: Int -> Handle -> Int -> IO [BL.ByteString]
makeReverseChunks blksize h top
| top == 0 = return []
| top < 0 = error "negative file index"
| otherwise = do
let offset = max (top - fromIntegral blksize) 0
hSeek h AbsoluteSeek (fromIntegral offset)
blk <- BL.hGet h blksize
rest <- makeReverseChunks blksize h offset
return $ blk : rest
这里是函数的变体 makeReverseChunks
。目前是相当严格的。此外,如果将 blksize 保持得足够低,使用惰性字节串也无济于事。为了实现惰性阅读,必须使用unsafeInterleaveIO
。这是一个使用严格字节串和惰性 IO 的解决方案:
-- create list of data chunks, backwards in order through the file
makeReverseChunks :: Int -> Handle -> Int -> IO [SBS.ByteString]
makeReverseChunks blksize h top
| top == 0 = return []
| top < 0 = error "negative file index"
| otherwise = do
let offset = max (top - fromIntegral blksize) 0
hSeek h AbsoluteSeek (fromIntegral offset)
blk <- SBS.hGet h blksize
rest <- unsafeInterleaveIO $ makeReverseChunks blksize h offset
return $ blk : rest