从 `ByteString` 中获取一个 `Char`

Question

有没有办法在 O(1) 时间内在 ByteString 中获得第一个 UTF-8 Char？我正在寻找类似

的内容

headUtf8 :: ByteString -> Char
tailUtf8 :: ByteString -> ByteString

我还没有被限制使用严格或惰性 ByteString，但我更喜欢严格。对于懒惰的 ByteString，我可以通过 Text 拼凑一些东西，但我不确定这是多么有效（尤其是 space-复杂性）。

import qualified Data.Text.Lazy as T
import Data.Text.Lazy.Encoding (decodeUtf8With, encodeUtf8)
import Data.Text.Encoding.Error (lenientDecode)

headUtf8 :: ByteString -> Char
headUtf8 = T.head . decodeUtf8With lenientDecode

tailUtf8 :: ByteString -> ByteString
tailUtf8 = encodeUtf8 . T.tail . decodeUtf8With lenientDecode

万一有人感兴趣，这个问题是在使用Alex制作支持UTF-8字符的词法分析器时出现的¹.

¹ 我知道从 Alex 3.0 开始你只需要提供 alexGetByte（这太棒了！）但我仍然需要能够获取字符在词法分析器的其他代码中。

Answer 1

The longest UTF-8 encoding is 6 bytes，所以如果我们尝试 1、2、... 字节，它将至少在第 6 步完成，因此是 O(1) :

import Data.Text as Text
import Data.Text.Encoding as Text
import Data.ByteString as BS

splitUtf8 :: ByteString -> (Char, ByteString)
splitUtf8 bs = go 1
  where
    go n | BS.null slack = (Text.head t, bs')
         | otherwise = go (n + 1)
      where
        (bs1, bs') = BS.splitAt n bs
        Some t slack _ = Text.streamDecodeUtf8 bs1

例如这里拆分一个2+3字节ByteString:

*SO_40414452> splitUtf8 $ BS.pack[197, 145, 226, 138, 162]
('7',"682")

这里是一个 3+2 字节的：

*SO_40414452> splitUtf8 $ BS.pack[226, 138, 162, 197, 145]
('66',"75")

Answer 2

您需要 utf8-string 包中的 Data.Bytestring.UTF8 模块。它包含具有以下签名的 uncons 函数：

uncons :: ByteString -> Maybe (Char, ByteString)

然后您可以定义：

headUtf8 :: ByteString -> Char
headUtf8 = fst . fromJust . uncons

tailUtf8 :: ByteString -> ByteString
tailUtf8 = snd . fromJust . uncons

从 `ByteString` 中获取一个 `Char`

Get a `Char` from a `ByteString`

text

haskell

utf-8

ghc

bytestring