CharacterSet.contains() 方法中的奇怪行为,高位 UTF8 字符与 ASCII 混合

Strange Behavior In CharacterSet.contains() Method, With High UTF8 Characters Mixed With ASCII

事情是这样的:我正在创建一个 StringProtocol 扩展以添加基于字符集进行拆分的能力(该字符集中的任何字符都用于拆分贪婪比较)。

问题是我在与同时具有少量 ASCII 字符和大量 UTF8 字符的字符集进行比较时遇到困难。

如果我只提供 UTF8 high 或 ASCII,则匹配正常。

我创建了一个 playground 来说明这一点。

奇怪的结果是倒数第二个打印输出(“Test String 2 does not have a space or a joker.”)。那应该说“是”。

问题是 CharacterSet 中的 space 匹配,但 joker 卡不匹配。

有什么想法吗?这是游乐场:

import Foundation

public extension StringProtocol {
    func containsOneOfThese(_ inCharacterset: CharacterSet) -> Bool {
        self.contains { (char) in
            char.unicodeScalars.contains { (scalar) in inCharacterset.contains(scalar) }
        }
    }
}

let space = " "
let joker = ""
let both = space + joker

let spadesNumberCards = ""
let spadesFaceCards = ""

let testString1 = spadesNumberCards + space + spadesFaceCards
let testString2 = spadesNumberCards + joker + spadesFaceCards
let testString3 = spadesNumberCards + both + spadesFaceCards

print("These Are The Strings We Are Testing:\n")
print("Test String 1: \"\(testString1)\"")
print("Test String 2: \"\(testString2)\"")
print("Test String 3: \"\(testString3)\"")
      
print("\nFirst, See If Any Of the Strings Contain Spaces:\n")
print("Test String 1 does \(testString1.containsOneOfThese(CharacterSet(charactersIn: space)) ? "" : "not ")have a space.")
print("Test String 2 does \(testString2.containsOneOfThese(CharacterSet(charactersIn: space)) ? "" : "not ")have a space.")
print("Test String 3 does \(testString3.containsOneOfThese(CharacterSet(charactersIn: space)) ? "" : "not ")have a space.")

print("\nNext, See If Any Of the Strings Contain Jokers:\n")
print("Test String 1 does \(testString1.containsOneOfThese(CharacterSet(charactersIn: joker)) ? "" : "not ")have a joker.")
print("Test String 2 does \(testString2.containsOneOfThese(CharacterSet(charactersIn: joker)) ? "" : "not ")have a joker.")
print("Test String 3 does \(testString3.containsOneOfThese(CharacterSet(charactersIn: joker)) ? "" : "not ")have a joker.")

print("\nOK, Now it gets weird:\n")

print("Test String 1 does \(testString1.containsOneOfThese(CharacterSet(charactersIn: both)) ? "" : "not ")have a space or a joker.")
print("Test String 2 does \(testString2.containsOneOfThese(CharacterSet(charactersIn: both)) ? "" : "not ")have a space or a joker.")
print("Test String 3 does \(testString3.containsOneOfThese(CharacterSet(charactersIn: both)) ? "" : "not ")have a space or a joker.")

打印出:

These Are The Strings We Are Testing:

Test String 1: " "
Test String 2: ""
Test String 3: " "

First, See If Any Of the Strings Contain Spaces:

Test String 1 does have a space.
Test String 2 does not have a space.
Test String 3 does have a space.

Next, See If Any Of the Strings Contain Jokers:

Test String 1 does not have a joker.
Test String 2 does have a joker.
Test String 3 does have a joker.

OK, Now it gets weird:

Test String 1 does have a space or a joker.
Test String 2 does not have a space or a joker.
Test String 3 does have a space or a joker.

如果字符串包含BMP(基本多语言平面)内外的字符,CharacterSet.init(charactersIn string: String)似乎无法正常工作:

let s = " "
let cs = CharacterSet(charactersIn: s)
s.unicodeScalars.forEach {
    print(cs.contains([=10=]))
}

// Expected output: true, true
// Actual output:   true, false

解决方法是改用从 Unicode 标量序列创建字符集:

let cs = CharacterSet(s.unicodeScalars)

这将产生预期的输出。

但请注意,这无法处理 Swift Character 的全部范围(包括由多个 Unicode 标量组成的字素簇)。因此,您可能希望使用 Set<Character> 代替。