Swift 中的规范化（组合和分解）utf8 字符串

Question

Unicode 字符串中带有重音符号的字符可以用“短”（组合）和“长”（分解）格式表示。这意味着在 Xcode 中，字符串 a 的长度为 8，而字符串 b 的长度为 10，即使它们看起来相同：

let a:String = "δέκα" // 8 bytes
print(a.data(using:String.Encoding.utf8)!.count)

let b:String = "δέκα" // 10 bytes
print(b.data(using:String.Encoding.utf8)!.count)

我需要“收缩”字符串以确保它们始终采用较短的“组合”格式。 Swift 是怎么做到的？

脚注： 我知道可以像这样（如下）完全去掉重音。我不想那样做，我只是想“组合”角色。

let usPosixLocale = Locale(identifier: "en_US_POSIX")
let out = "δέκα".folding(options: [.caseInsensitive, .diacriticInsensitive], locale: usPosixLocale)

我知道 .widthInsensitive 选项，但文档似乎表明它仅适用于亚洲字符。因此，具体而言，这不用于组合或分解字符：

let out = a.folding(options: [.widthInsensitive], locale: usPosixLocale)

更新

这是代码的第二个较长版本，为了清楚起见，显示了字节差异。

let a:String = String(bytes:[206, 180, 206, 173, 206, 186, 206, 177], encoding:.utf8)!
print(a, a.data(using:String.Encoding.utf8)!.count)

let b:String = String(bytes:[206, 180, 206, 181, 204, 129, 206, 186, 206, 177], encoding:.utf8)!
print(b, b.data(using:String.Encoding.utf8)!.count)

let usPosixLocale = Locale(identifier: "en_US_POSIX")
let out = b.folding(options: [.widthInsensitive], locale: usPosixLocale)
    print(out.data(using:String.Encoding.utf8)!.count)

Answer 1

感谢 @matt 指向 CFStringNormalize(_:_:)

以下是您可以执行此操作的方法 -

import Foundation
import CoreFoundation

extension String {
    func normalizedCanonicallyComposed() -> String {
        let mutable = NSMutableString(string: self) as CFMutableString
        CFStringNormalize(mutable, .KC) // OR .C
        return mutable as String
    }
}

用法

let a: String = String(bytes: [206, 180, 206, 173, 206, 186, 206, 177], encoding: .utf8)!
print(a, a.data(using: .utf8)!.count)

let b: String = String(bytes: [206, 180, 206, 181, 204, 129, 206, 186, 206, 177], encoding: .utf8)!
print(b, b.data(using: .utf8)!.count)
        
print("Before - \(b), count: \(b.data(using: .utf8)!.count)")
let c = b.normalizedCanonicallyComposed()
print("After - \(c), count: \(c.data(using: .utf8)!.count)")

输出

δέκα 8
δέκα 10
Before - δέκα, count: 10
After - δέκα, count: 8

Answer 2

precomposedStringWithCanonicalMapping 进行归一化：

let a = "δέκα"
print(a, Data(a.utf8).count) // δέκα 8

let b = "δε\u{0301}κα"
print(b, Data(b.utf8).count) // δέκα 10

let bn = b.precomposedStringWithCanonicalMapping
print(bn, Data(bn.utf8).count) // δέκα 8

“字面”比较表明 a 与 bn 相同，但与 b 不同：

print(b.compare(a, options: .literal) == .orderedSame)  // false
print(bn.compare(a, options: .literal) == .orderedSame) // true

备注： precomposedStringWithCanonicalMapping 生成“Unicode Normalization Form C”。还有 precomposedStringWithCompatibilityMapping 生成“Unicode Normalization Form KC”。参见

1.2 Normalization Forms

在 Unicode 标准中的精确差异。粗略地说，后者折叠了更多“在许多情况下不适当地区分”的差异。示例：

let c = "\u{fb01}" // LATIN SMALL LIGATURE FI
print(c, c.precomposedStringWithCanonicalMapping, c.precomposedStringWithCompatibilityMapping)
// ﬁ ﬁ fi

let d = "2\u{2075}"
print(d, d.precomposedStringWithCanonicalMapping, d.precomposedStringWithCompatibilityMapping)
// 2⁵ 2⁵ 25

let e = "\u{2165}" // ROMAN NUMERAL SIX
print(e, e.precomposedStringWithCanonicalMapping, e.precomposedStringWithCompatibilityMapping)
// Ⅵ Ⅵ VI

Swift 中的规范化（组合和分解）utf8 字符串

Normalizing (composing and decomposing) utf8 strings in Swift

string

unicode

unicode-normalization

swift