如果一个代码点序列构成一个 Unicode 字符，那么该序列的每个非空前缀是否也构成一个有效字符？

Question

我遇到的问题是，给定一个字节序列，我想确定其最长的前缀，该前缀构成一个有效的 Unicode 字符（扩展字素簇），假设采用 UTF8 编码。

我正在使用 Swift，所以我想使用 Swift 的内置函数来这样做。但是这些函数只能解码完整的字节序列。所以我想通过 Swift 转换字节序列的前缀，并采用最后一个没有失败的前缀，只包含 1 个字符。显然，这可能会导致尝试整个字节序列，我想避免这种情况。一个解决方案是在连续 4 个前缀失败后停止尝试前缀。如果我的问题中的属性成立，那么这将保证所有更长的前缀也必须失败。

我觉得Unicode Text Segmentation Standard看不懂，否则我会尝试直接实现扩展字素簇的边界检测...

Answer 1

在 https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundary_Rules 仔细研究了计算扩展字素簇 (EGC) 边界的规范后，很明显，EGC 的规则都具有描述何时允许将代码点附加到现有 EGC 以形成更长的 EGC 的形式。仅从这个事实来看，我的两个问题如下：1）是的，构成 EGC 的代码点的每个非空前缀也是 EGC。 2) 不，通过向有效的 Unicode 字符串添加代码点，您不会减少其包含的 EGC 数量的长度。

因此，鉴于此，以下 Swift 代码将从字节序列的开头提取最长的 Unicode 字符（如果那里没有有效的 Unicode 字符，则 return nil）：

    func lex<S : Sequence>(_ input : S) -> (length : Int, out: Character)? where S.Element == UInt8 {
        // This code works under three assumptions, all of which are true:
        // 1) If a sequence of codepoints does not form a valid character, then appending codepoints to it does not yield a valid character
        // 2) Appending codepoints to a sequence of codepoints does not decrease its length in terms of extended grapheme clusters
        // 3) a codepoint takes up at most 4 bytes in an UTF8 encoding
        var chars : [UInt8] = []
        var result : String = ""
        var resultLength = 0
        func value() -> (length : Int, out : Character)? {
            guard let character = result.first else { return nil }
            return (length: resultLength, out: character)
        }
        var length = 0
        var iterator = input.makeIterator()
        while length - resultLength <= 4 {
            guard let char = iterator.next() else { return value() }
            chars.append(char)
            length += 1
            guard let s = String(bytes: chars, encoding: .utf8) else { continue }
            guard s.count == 1 else { return value() }
            result = s
            resultLength = length
        }
        return value()
    }

如果一个代码点序列构成一个 Unicode 字符，那么该序列的每个非空前缀是否也构成一个有效字符？

If a sequence of code points forms a Unicode character, does every non-empty prefix of that sequence also form a valid character?

unicode

swift