为什么 Swift 将这个字素簇计为两个字符而不是一个字符?
Why is Swift counting this Grapheme Cluster as two characters instead of one?
一般来说,Swift 真的 很聪明,可以将字素簇计算为单个字符。例如,如果我想制作一面黎巴嫩国旗,我可以将两个 Unicode 字符组合起来
- U+1F1F1 区域指标符号字母 L
- U+1F1E7 区域指标符号字母 B
正如预期的那样,这是 Swift 中的一个字符:
let s = "\u{1f1f1}\u{1f1e7}"
assert(s.characters.count == 1)
assert(s.utf16.count == 4)
assert(s.utf8.count == 8)
但是,假设我想制作 Fitzpatrick Type-5 的自行车表情符号。如果我结合
- U+1F6B4 自行车
- U+1F3FE 表情符号修改器 FITZPATRICK TYPE-5
Swift 将此组合计为 两个 个字符!
let s = "\u{1f6b4}\u{1f3fe}"
assert(s.characters.count == 2) // <----- WHY?
assert(s.utf16.count == 4)
assert(s.utf8.count == 8)
为什么这是两个字符而不是一个?
为了说明为什么我期望它是 1,请注意这个集群实际上被解释为有效的表情符号:
Unicode 邮件列表上 Ken Whistler 的 bug report mentioned in emrys57's comment. When splitting a Unicode string into "characters", Swift apparently uses the Grapheme Cluster Boundaries defined in UAX #29 Unicode Text Segmentation. There's a rule not to break between regional indicator symbols, but there is no such rule for Emoji modifiers. So, according to UAX #29, the string "\u{1f6b4}\u{1f3fe}"
contains two grapheme clusters. See this message 给出了部分答案的解释:
This results from the fact that the fallback behavior for the modifiers is
simply as independent pictographic blorts, i.e. the color swatch images. [...] You need additional, specific
knowledge about these sequences -- it doesn't just fall out from a
default implementation of UAX #29 rules for grapheme clusters.
一般来说,Swift 真的 很聪明,可以将字素簇计算为单个字符。例如,如果我想制作一面黎巴嫩国旗,我可以将两个 Unicode 字符组合起来
- U+1F1F1 区域指标符号字母 L
- U+1F1E7 区域指标符号字母 B
正如预期的那样,这是 Swift 中的一个字符:
let s = "\u{1f1f1}\u{1f1e7}"
assert(s.characters.count == 1)
assert(s.utf16.count == 4)
assert(s.utf8.count == 8)
但是,假设我想制作 Fitzpatrick Type-5 的自行车表情符号。如果我结合
- U+1F6B4 自行车
- U+1F3FE 表情符号修改器 FITZPATRICK TYPE-5
Swift 将此组合计为 两个 个字符!
let s = "\u{1f6b4}\u{1f3fe}"
assert(s.characters.count == 2) // <----- WHY?
assert(s.utf16.count == 4)
assert(s.utf8.count == 8)
为什么这是两个字符而不是一个?
为了说明为什么我期望它是 1,请注意这个集群实际上被解释为有效的表情符号:
Unicode 邮件列表上 Ken Whistler 的 bug report mentioned in emrys57's comment. When splitting a Unicode string into "characters", Swift apparently uses the Grapheme Cluster Boundaries defined in UAX #29 Unicode Text Segmentation. There's a rule not to break between regional indicator symbols, but there is no such rule for Emoji modifiers. So, according to UAX #29, the string "\u{1f6b4}\u{1f3fe}"
contains two grapheme clusters. See this message 给出了部分答案的解释:
This results from the fact that the fallback behavior for the modifiers is simply as independent pictographic blorts, i.e. the color swatch images. [...] You need additional, specific knowledge about these sequences -- it doesn't just fall out from a default implementation of UAX #29 rules for grapheme clusters.