Haskell:向后读取一个二进制文件

Haskell: Read a binary file backwards

我正在使用 Haskell 在 uInt32 二进制转储中查找匹配特定模式的最后一个 32 位字。我可以使用 last 完成任务,但是代码必须遍历整个文件,因此效率很低。

有没有简单的方法让readfile通过文件反向操作?我相信这将通过对当前代码进行最小的更改来解决问题。

这是我目前的代码,供参考。我这个周末才开始使用 Haskell,所以我确信它非常难看。它在 MSB 处查找以 0b10 开头的最后一个 32 位字。

import System.Environment(getArgs)
import qualified Data.ByteString.Lazy as BL
import qualified Data.ByteString.Lazy.Internal as BL
import qualified Data.ByteString as BS
import Data.Binary.Get
import Data.Word
import Data.Bits
import Text.Printf(printf)

main = do
  args <- getArgs
  let file = args!!0
  putStrLn $ "Find last 0xCXXXXXXX in " ++ file

  content <- BL.readFile file

  let packets = getPackets content
  putStrLn . show . getValue . last . filterTimes $ packets

-- Data

type Packet = Word32

-- filter where first 2 bits are 10
filterTimes :: [Packet] -> [Packet]
filterTimes = filter ((== 0x2) . tag)

-- get the first 2 bits
tag :: Packet -> Packet
tag rp =
  let tagSize = 2
  in  shiftR rp (finiteBitSize rp - tagSize)

-- remove the tag bits
getValue :: Packet -> Packet
getValue =
  let tagSize = 2
      mask    = complement $ rotateR (2^tagSize - 1) tagSize
  in (.&.) mask

-- Input
-- Based on https://hackage.haskell.org/package/binary/docs/Data-Binary-Get.html

getPacket :: Get Packet
getPacket = do
  packet <- getWord32le
  return $! packet

getPackets :: BL.ByteString -> [Packet]
getPackets input0 = go decoder input0
  where
    decoder = runGetIncremental getPacket
    go :: Decoder Packet -> BL.ByteString -> [Packet]
    go (Done leftover _consumed packet) input =
      packet : go decoder (BL.chunk leftover input)
    go (Partial k) input                     =
      go (k . takeHeadChunk $ input) (dropHeadChunk input)
    go (Fail _leftover _consumed msg) _input =
      []

takeHeadChunk :: BL.ByteString -> Maybe BS.ByteString
takeHeadChunk lbs =
  case lbs of
    (BL.Chunk bs _) -> Just bs
    _ -> Nothing

dropHeadChunk :: BL.ByteString -> BL.ByteString
dropHeadChunk lbs =
  case lbs of
    (BL.Chunk _ lbs') -> lbs'
    _ -> BL.Empty

对您的代码的一些评论:

  1. 您正在使用 last,这可能会引发异常。您应该使用 safe 包中的 lastMay,其中 returns 可能是

  2. 由于您只是将文件视为 Word32 的向量,我认为不值得使用 Data.Binary.Get 以及它带来的相关开销和复杂性。只需将文件视为(可能是惰性的)ByteString 并访问每 4 个字节或将其分解为 4 字节的子字符串。

您可以查看使用 ByteStrings here 的代码。它实现了以下解决问题的方法:

  • 将整个文件作为惰性 ByteString 读入并生成 4 字节子字符串的(惰性)列表。 Return最后一个满足条件的子串。

    intoWords :: BL.ByteString -> [ BL.ByteString ]
    intoWords bs
      | BL.null a = []
      | otherwise = a : intoWords b
      where (a,b) = BL.splitAt 4 bs
    
    -- find by breaking the file into 4-byte words
    find_C0_v1 :: FilePath -> IO (Maybe BL.ByteString)
    find_C0_v1 path = do
      contents <- BL.readFile path
      return $ lastMay . filter (\bs -> BL.index bs 0 == 0xC0) . intoWords $ contents
    
  • 将整个文件作为惰性字节串读入,并访问每 4 个字节以查找 0xC0。 Return 最后一次出现。

    -- find by looking at every 4th byte
    find_C0_v2 :: FilePath -> IO (Maybe BL.ByteString)
    find_C0_v2 path = do
      contents <- BL.readFile path
      size <- fmap fromIntegral $ withFile path ReadMode hFileSize
      let wordAt i = BL.take 4 . BL.drop i $ contents
      return $ fmap wordAt $ lastMay $ filter (\i -> BL.index contents i == 0xC0) [0,4..size-1]
    
  • 以 64K 的块向后读取文件。在每个块(这是一个严格的 ByteString)中,每 4 个字节访问一次,以查找从块末尾开始的 0xC0。 Return 第一次出现。

    -- read a file backwords until a predicate returns a Just value
    loopBlocks :: Int -> Handle -> Integer -> (BS.ByteString -> Integer -> Maybe a) -> IO (Maybe a)
    loopBlocks blksize h top pred
      | top <= 0 = return Nothing
      | otherwise   = do
            let offset = top - fromIntegral blksize
            hSeek h AbsoluteSeek offset
            blk <- BS.hGet h blksize
            case pred blk offset of
              Nothing -> loopBlocks blksize h offset pred
              x       -> return x
    
    -- find by reading backwords lookint at every 4th byte
    find_C0_v3 :: FilePath -> IO (Maybe Integer)
    find_C0_v3 path = do
      withFile path ReadMode $ \h -> do
        size <- hFileSize h
        let top = size - (mod size 4)
            blksize = 64*1024 :: Int
        loopBlocks blksize h top $ \blk offset ->
              fmap ( (+offset) . fromIntegral ) $ headMay $ filter (\i -> BS.index blk i == 0xC0) [blksize-4,blksize-8..0]
    

第三种方法是最快的,即使它必须读入整个文件。第一种方法实际上效果很好。我根本不推荐第二种——它的性能会随着文件大小的增长而急剧下降。

对于任何其他可能感兴趣的人,我已经改编了@ErikR 的回答。此解决方案遵循他提出的解决方案 3,但通过懒惰地反向步进块来利用我现有的代码。

这需要一些额外的导入:

import System.IO
import Safe
import Data.Maybe

main 变为:

main = do
  args <- getArgs
  let file = args!!0
  putStrLn $ "Find last 0xCXXXXXXX in " ++ file

  -- forward
  withFile file ReadMode $ \h -> do
    content <- BL.hGetContents h
    let packets = getPackets content
    putStrLn . show . getValue . last . filterTimes $ packets

  -- reverse
  withFile file ReadMode $ \h -> do
    size <- hFileSize h
    let blksize = 64*1024 :: Int
    chunks <- makeReverseChunks blksize h (fromIntegral size)
    putStrLn . show . getValue . (fromMaybe 0) . headMay . catMaybes . (map $ lastMay . filterTimes . getPackets) $ chunks

添加了辅助函数:

-- create list of data chunks, backwards in order through the file
makeReverseChunks :: Int -> Handle -> Int -> IO [BL.ByteString]
makeReverseChunks blksize h top
  | top == 0 = return []
  | top < 0  = error "negative file index"
  | otherwise   = do
        let offset = max (top - fromIntegral blksize) 0
        hSeek h AbsoluteSeek (fromIntegral offset)
        blk <- BL.hGet h blksize
        rest <- makeReverseChunks blksize h offset
        return $ blk : rest

这里是函数的变体 makeReverseChunks。目前是相当严格的。此外,如果将 blksize 保持得足够低,使用惰性字节串也无济于事。为了实现惰性阅读,必须使用unsafeInterleaveIO。这是一个使用严格字节串和惰性 IO 的解决方案:

-- create list of data chunks, backwards in order through the file
makeReverseChunks :: Int -> Handle -> Int -> IO [SBS.ByteString]
makeReverseChunks blksize h top
  | top == 0 = return []
  | top < 0  = error "negative file index"
  | otherwise   = do
    let offset = max (top - fromIntegral blksize) 0
    hSeek h AbsoluteSeek (fromIntegral offset)
    blk <- SBS.hGet h blksize
    rest <- unsafeInterleaveIO $ makeReverseChunks blksize h offset
    return $ blk : rest