如何有效地解析熵编码的 JPEG 块？

Question

我只是想跳过 .JPEG 文件中的 SOS_MT 块，我不想将数据用于任何用途，我只想知道它在哪里结束。根据我对 JPEG's article in Wikipedia 的理解，虽然 JPEG 文件中的所有其他块都以指示块长度的几个字节开头，但 SOS_MT 块是......好吧，你拥有的邪恶沼泽别无选择，只能逐字节解析，直到结束为止。

所以我使用以下代码来做到这一点：

entropyCoded :: Parser Int
entropyCoded = do
    list_of_lengths <-  many' $
         (
           do
             _ <- notWord8 0xFF
             return 1
         )
         <|>
         (
           do
             _ <- word8 0xFF
             _ <- word8 0
             return 2
         )
         <|>
         (
           do
             l <- many1 (word8 0xFF)
             _ <- satisfy (\x -> ( x >= 0xD0 && x < 0xD7 ))
             return $ 1 + length l
         )
         <|>
         (
           do
             _ <- word8 0xFF
             maybe_ff <- peekWord8'
             if maybe_ff == 0xFF
               then
                 return 1
               else
                 fail "notthere"
         )
    foldM (\ nn n -> nn `seq` return (nn + n) ) 0 list_of_lengths

此代码使用 Atoparsec，据我有机会验证，它是正确的。它只是很慢。关于如何在性能方面改进此解析器的任何提示？

Answer 1

如果您想跳过 SOS 市场，只需寻找下一个不是重启标记的标记。

读取字节直到找到FF。如果下一个值为 00，则它是一个压缩的 FF 值并跳过它。如果它是重启标记，请跳过它。否则，FF应该开始下一个块。

Answer 2

根据 ISO/IEC 10918-1 : 1993(E) 标准，对先前答案的一个小补充：

B.1.1.5 Entropy-coded data segments

An entropy-coded data segment contains the output of an entropy-coding procedure. It consists of an integer number of bytes, whether the entropy-coding procedure used is Huffman or arithmetic.

NOTE 1

Making entropy-coded segments an integer number of bytes is performed as follows: for Huffman coding, 1-bits are used, if necessary, to pad the end of the compressed data to complete the final byte of a segment. For arithmetic coding, byte alignment is performed in the procedure which terminates the entropy-coded segment (see D.1.8).

NOTE 2

In order to ensure that a marker does not occur within an entropy-coded segment, any X'FF' byte generated by either a Huffman or arithmetic encoder, or an X'FF' byte that was generated by the padding of 1-bits described in NOTE 1 above, is followed by a "stuffed" zero byte (see D.1.6 and F.1.2.3).

因此，当您在 N 位置的熵编码部分遇到 0xFF 时，再向前读一个字节。如果下一个字节是 0x00 那么它就是一个 "stuffed" 零。如果是另一个 0xFF 则有填充，从 N+1 重新检查。每隔一个字节 (0x01-0xFE) 是下一个标记的一部分。

如何有效地解析熵编码的 JPEG 块？

How to parse an entropy-coded JPEG block efficiently?

parsing

jpeg

haskell

attoparsec