为什么 Swift 将这个字素簇计为两个字符而不是一个字符？

Question

一般来说，Swift 真的很聪明，可以将字素簇计算为单个字符。例如，如果我想制作一面黎巴嫩国旗，我可以将两个 Unicode 字符组合起来

U+1F1F1 区域指标符号字母 L
U+1F1E7 区域指标符号字母 B

正如预期的那样，这是 Swift 中的一个字符：

let s = "\u{1f1f1}\u{1f1e7}"
assert(s.characters.count == 1)
assert(s.utf16.count == 4)
assert(s.utf8.count == 8)

但是，假设我想制作 Fitzpatrick Type-5 的自行车表情符号。如果我结合

U+1F6B4 自行车
U+1F3FE 表情符号修改器 FITZPATRICK TYPE-5

Swift 将此组合计为两个个字符！

let s = "\u{1f6b4}\u{1f3fe}"
assert(s.characters.count == 2)   // <----- WHY?
assert(s.utf16.count == 4)
assert(s.utf8.count == 8)

为什么这是两个字符而不是一个？

为了说明为什么我期望它是 1，请注意这个集群实际上被解释为有效的表情符号：

Answer 1

Unicode 邮件列表上 Ken Whistler 的 bug report mentioned in emrys57's comment. When splitting a Unicode string into "characters", Swift apparently uses the Grapheme Cluster Boundaries defined in UAX #29 Unicode Text Segmentation. There's a rule not to break between regional indicator symbols, but there is no such rule for Emoji modifiers. So, according to UAX #29, the string "\u{1f6b4}\u{1f3fe}" contains two grapheme clusters. See this message 给出了部分答案的解释：

This results from the fact that the fallback behavior for the modifiers is simply as independent pictographic blorts, i.e. the color swatch images. [...] You need additional, specific knowledge about these sequences -- it doesn't just fall out from a default implementation of UAX #29 rules for grapheme clusters.

为什么 Swift 将这个字素簇计为两个字符而不是一个字符？

Why is Swift counting this Grapheme Cluster as two characters instead of one?

unicode

emoji

grapheme

swift