Swift 中的规范化(组合和分解)utf8 字符串
Normalizing (composing and decomposing) utf8 strings in Swift
Unicode 字符串中带有重音符号的字符可以用“短”(组合)和“长”(分解)格式表示。这意味着在 Xcode 中,字符串 a
的长度为 8,而字符串 b
的长度为 10,即使它们看起来相同:
let a:String = "δέκα" // 8 bytes
print(a.data(using:String.Encoding.utf8)!.count)
let b:String = "δέκα" // 10 bytes
print(b.data(using:String.Encoding.utf8)!.count)
我需要“收缩”字符串以确保它们始终采用较短的“组合”格式。 Swift 是怎么做到的?
脚注: 我知道可以像这样(如下)完全去掉重音。我不想那样做,我只是想“组合”角色。
let usPosixLocale = Locale(identifier: "en_US_POSIX")
let out = "δέκα".folding(options: [.caseInsensitive, .diacriticInsensitive], locale: usPosixLocale)
我知道 .widthInsensitive
选项,但文档似乎表明它仅适用于亚洲字符。因此,具体而言,这 不 用于组合或分解字符:
let out = a.folding(options: [.widthInsensitive], locale: usPosixLocale)
更新
这是代码的第二个较长版本,为了清楚起见,显示了字节差异。
let a:String = String(bytes:[206, 180, 206, 173, 206, 186, 206, 177], encoding:.utf8)!
print(a, a.data(using:String.Encoding.utf8)!.count)
let b:String = String(bytes:[206, 180, 206, 181, 204, 129, 206, 186, 206, 177], encoding:.utf8)!
print(b, b.data(using:String.Encoding.utf8)!.count)
let usPosixLocale = Locale(identifier: "en_US_POSIX")
let out = b.folding(options: [.widthInsensitive], locale: usPosixLocale)
print(out.data(using:String.Encoding.utf8)!.count)
感谢 @matt 指向 CFStringNormalize(_:_:)
以下是您可以执行此操作的方法 -
import Foundation
import CoreFoundation
extension String {
func normalizedCanonicallyComposed() -> String {
let mutable = NSMutableString(string: self) as CFMutableString
CFStringNormalize(mutable, .KC) // OR .C
return mutable as String
}
}
用法
let a: String = String(bytes: [206, 180, 206, 173, 206, 186, 206, 177], encoding: .utf8)!
print(a, a.data(using: .utf8)!.count)
let b: String = String(bytes: [206, 180, 206, 181, 204, 129, 206, 186, 206, 177], encoding: .utf8)!
print(b, b.data(using: .utf8)!.count)
print("Before - \(b), count: \(b.data(using: .utf8)!.count)")
let c = b.normalizedCanonicallyComposed()
print("After - \(c), count: \(c.data(using: .utf8)!.count)")
输出
δέκα 8
δέκα 10
Before - δέκα, count: 10
After - δέκα, count: 8
precomposedStringWithCanonicalMapping
进行归一化:
let a = "δέκα"
print(a, Data(a.utf8).count) // δέκα 8
let b = "δε\u{0301}κα"
print(b, Data(b.utf8).count) // δέκα 10
let bn = b.precomposedStringWithCanonicalMapping
print(bn, Data(bn.utf8).count) // δέκα 8
“字面”比较表明 a
与 bn
相同,但与 b
不同:
print(b.compare(a, options: .literal) == .orderedSame) // false
print(bn.compare(a, options: .literal) == .orderedSame) // true
备注: precomposedStringWithCanonicalMapping
生成“Unicode Normalization Form C”。还有 precomposedStringWithCompatibilityMapping
生成“Unicode Normalization Form KC”。参见
在 Unicode 标准中的精确差异。粗略地说,后者折叠了更多“在许多情况下不适当地区分”的差异。示例:
let c = "\u{fb01}" // LATIN SMALL LIGATURE FI
print(c, c.precomposedStringWithCanonicalMapping, c.precomposedStringWithCompatibilityMapping)
// fi fi fi
let d = "2\u{2075}"
print(d, d.precomposedStringWithCanonicalMapping, d.precomposedStringWithCompatibilityMapping)
// 2⁵ 2⁵ 25
let e = "\u{2165}" // ROMAN NUMERAL SIX
print(e, e.precomposedStringWithCanonicalMapping, e.precomposedStringWithCompatibilityMapping)
// Ⅵ Ⅵ VI
Unicode 字符串中带有重音符号的字符可以用“短”(组合)和“长”(分解)格式表示。这意味着在 Xcode 中,字符串 a
的长度为 8,而字符串 b
的长度为 10,即使它们看起来相同:
let a:String = "δέκα" // 8 bytes
print(a.data(using:String.Encoding.utf8)!.count)
let b:String = "δέκα" // 10 bytes
print(b.data(using:String.Encoding.utf8)!.count)
我需要“收缩”字符串以确保它们始终采用较短的“组合”格式。 Swift 是怎么做到的?
脚注: 我知道可以像这样(如下)完全去掉重音。我不想那样做,我只是想“组合”角色。
let usPosixLocale = Locale(identifier: "en_US_POSIX")
let out = "δέκα".folding(options: [.caseInsensitive, .diacriticInsensitive], locale: usPosixLocale)
我知道 .widthInsensitive
选项,但文档似乎表明它仅适用于亚洲字符。因此,具体而言,这 不 用于组合或分解字符:
let out = a.folding(options: [.widthInsensitive], locale: usPosixLocale)
更新
这是代码的第二个较长版本,为了清楚起见,显示了字节差异。
let a:String = String(bytes:[206, 180, 206, 173, 206, 186, 206, 177], encoding:.utf8)!
print(a, a.data(using:String.Encoding.utf8)!.count)
let b:String = String(bytes:[206, 180, 206, 181, 204, 129, 206, 186, 206, 177], encoding:.utf8)!
print(b, b.data(using:String.Encoding.utf8)!.count)
let usPosixLocale = Locale(identifier: "en_US_POSIX")
let out = b.folding(options: [.widthInsensitive], locale: usPosixLocale)
print(out.data(using:String.Encoding.utf8)!.count)
感谢 @matt 指向 CFStringNormalize(_:_:)
以下是您可以执行此操作的方法 -
import Foundation
import CoreFoundation
extension String {
func normalizedCanonicallyComposed() -> String {
let mutable = NSMutableString(string: self) as CFMutableString
CFStringNormalize(mutable, .KC) // OR .C
return mutable as String
}
}
用法
let a: String = String(bytes: [206, 180, 206, 173, 206, 186, 206, 177], encoding: .utf8)!
print(a, a.data(using: .utf8)!.count)
let b: String = String(bytes: [206, 180, 206, 181, 204, 129, 206, 186, 206, 177], encoding: .utf8)!
print(b, b.data(using: .utf8)!.count)
print("Before - \(b), count: \(b.data(using: .utf8)!.count)")
let c = b.normalizedCanonicallyComposed()
print("After - \(c), count: \(c.data(using: .utf8)!.count)")
输出
δέκα 8
δέκα 10
Before - δέκα, count: 10
After - δέκα, count: 8
precomposedStringWithCanonicalMapping
进行归一化:
let a = "δέκα"
print(a, Data(a.utf8).count) // δέκα 8
let b = "δε\u{0301}κα"
print(b, Data(b.utf8).count) // δέκα 10
let bn = b.precomposedStringWithCanonicalMapping
print(bn, Data(bn.utf8).count) // δέκα 8
“字面”比较表明 a
与 bn
相同,但与 b
不同:
print(b.compare(a, options: .literal) == .orderedSame) // false
print(bn.compare(a, options: .literal) == .orderedSame) // true
备注: precomposedStringWithCanonicalMapping
生成“Unicode Normalization Form C”。还有 precomposedStringWithCompatibilityMapping
生成“Unicode Normalization Form KC”。参见
在 Unicode 标准中的精确差异。粗略地说,后者折叠了更多“在许多情况下不适当地区分”的差异。示例:
let c = "\u{fb01}" // LATIN SMALL LIGATURE FI
print(c, c.precomposedStringWithCanonicalMapping, c.precomposedStringWithCompatibilityMapping)
// fi fi fi
let d = "2\u{2075}"
print(d, d.precomposedStringWithCanonicalMapping, d.precomposedStringWithCompatibilityMapping)
// 2⁵ 2⁵ 25
let e = "\u{2165}" // ROMAN NUMERAL SIX
print(e, e.precomposedStringWithCanonicalMapping, e.precomposedStringWithCompatibilityMapping)
// Ⅵ Ⅵ VI