Swift 中的规范化(组合和分解)utf8 字符串

Normalizing (composing and decomposing) utf8 strings in Swift

Unicode 字符串中带有重音符号的字符可以用“短”(组合)和“长”(分解)格式表示。这意味着在 Xcode 中,字符串 a 的长度为 8,而字符串 b 的长度为 10,即使它们看起来相同:

let a:String = "δέκα" // 8 bytes
print(a.data(using:String.Encoding.utf8)!.count)

let b:String = "δέκα" // 10 bytes
print(b.data(using:String.Encoding.utf8)!.count)

我需要“收缩”字符串以确保它们始终采用较短的“组合”格式。 Swift 是怎么做到的?


脚注: 我知道可以像这样(如下)完全去掉重音。我不想那样做,我只是想“组合”角色。

let usPosixLocale = Locale(identifier: "en_US_POSIX")
let out = "δέκα".folding(options: [.caseInsensitive, .diacriticInsensitive], locale: usPosixLocale)

我知道 .widthInsensitive 选项,但文档似乎表明它仅适用于亚洲字符。因此,具体而言,这 用于组合或分解字符:

let out = a.folding(options: [.widthInsensitive], locale: usPosixLocale)

更新

这是代码的第二个较长版本,为了清楚起见,显示了字节差异。

let a:String = String(bytes:[206, 180, 206, 173, 206, 186, 206, 177], encoding:.utf8)!
print(a, a.data(using:String.Encoding.utf8)!.count)

let b:String = String(bytes:[206, 180, 206, 181, 204, 129, 206, 186, 206, 177], encoding:.utf8)!
print(b, b.data(using:String.Encoding.utf8)!.count)

let usPosixLocale = Locale(identifier: "en_US_POSIX")
let out = b.folding(options: [.widthInsensitive], locale: usPosixLocale)
    print(out.data(using:String.Encoding.utf8)!.count)

感谢 @matt 指向 CFStringNormalize(_:_:)

以下是您可以执行此操作的方法 -

import Foundation
import CoreFoundation

extension String {
    func normalizedCanonicallyComposed() -> String {
        let mutable = NSMutableString(string: self) as CFMutableString
        CFStringNormalize(mutable, .KC) // OR .C
        return mutable as String
    }
}

用法

let a: String = String(bytes: [206, 180, 206, 173, 206, 186, 206, 177], encoding: .utf8)!
print(a, a.data(using: .utf8)!.count)

let b: String = String(bytes: [206, 180, 206, 181, 204, 129, 206, 186, 206, 177], encoding: .utf8)!
print(b, b.data(using: .utf8)!.count)
        
print("Before - \(b), count: \(b.data(using: .utf8)!.count)")
let c = b.normalizedCanonicallyComposed()
print("After - \(c), count: \(c.data(using: .utf8)!.count)")

输出

δέκα 8
δέκα 10
Before - δέκα, count: 10
After - δέκα, count: 8

precomposedStringWithCanonicalMapping 进行归一化:

let a = "δέκα"
print(a, Data(a.utf8).count) // δέκα 8

let b = "δε\u{0301}κα"
print(b, Data(b.utf8).count) // δέκα 10

let bn = b.precomposedStringWithCanonicalMapping
print(bn, Data(bn.utf8).count) // δέκα 8

“字面”比较表明 abn 相同,但与 b 不同:

print(b.compare(a, options: .literal) == .orderedSame)  // false
print(bn.compare(a, options: .literal) == .orderedSame) // true

备注: precomposedStringWithCanonicalMapping 生成“Unicode Normalization Form C”。还有 precomposedStringWithCompatibilityMapping 生成“Unicode Normalization Form KC”。参见

在 Unicode 标准中的精确差异。粗略地说,后者折叠了更多“在许多情况下不适当地区分”的差异。示例:

let c = "\u{fb01}" // LATIN SMALL LIGATURE FI
print(c, c.precomposedStringWithCanonicalMapping, c.precomposedStringWithCompatibilityMapping)
// fi fi fi

let d = "2\u{2075}"
print(d, d.precomposedStringWithCanonicalMapping, d.precomposedStringWithCompatibilityMapping)
// 2⁵ 2⁵ 25

let e = "\u{2165}" // ROMAN NUMERAL SIX
print(e, e.precomposedStringWithCanonicalMapping, e.precomposedStringWithCompatibilityMapping)
// Ⅵ Ⅵ VI