测试 CharacterSet 是否包含 Swift 4 中的字符的最佳方法是什么?

What is the best way to test if a CharacterSet contains a Character in Swift 4?

我正在 Swift 4 中寻找一种方法来测试角色是否是任意角色集的成员。我有这个 Scanner class 将用于一些轻量级解析。 class 中的一个功能是跳过当前位置属于某一组可能字符的任何字符。

class MyScanner {
  let str: String
  var idx: String.Index
  init(_ string: String) {
    str = string
    idx = str.startIndex
  }
  var remains: String { return String(str[idx..<str.endIndex])}

  func skip(charactersIn characters: CharacterSet) {
    while idx < str.endIndex && characters.contains(str[idx])) {
      idx = source.index(idx, offsetBy: 1)
    }
  }
}

let scanner = MyScanner("fizz   buzz fizz")
scanner.skip(charactersIn: CharacterSet.alphanumerics)
scanner.skip(charactersIn: CharacterSet.whitespaces)
print("what remains: \"\(scanner.remains)\"")

我想实现 skip(charactersIn:) 函数以便上面的代码打印 buzz fizz.

棘手的部分是 while 中的 characters.contains(str[idx])) - .contains() 需要 Unicode.Scalar,我不知所措,想知道下一步。

我知道我可以将 String 传递给 skip 函数,但我想找到一种方法让它与 CharacterSet 一起工作,因为所有方便的静态成员(alphanumericswhitespaces 等)。

如果 CharacterSet 包含 Character,如何测试它?

不确定这是否是最有效的方法,但您可以创建一个新的 CharSet 并检查它们是否 sub/super-sets(设置比较相当快)

let newSet = CharacterSet(charactersIn: "a")
// let newSet = CharacterSet(charactersIn: "\(character)")
print(newSet.isSubset(of: CharacterSet.decimalDigits)) // false
print(newSet.isSubset(of: CharacterSet.alphanumerics)) // true

我知道您想使用 CharacterSet 而不是 String,但是 CharacterSet 不(至少目前)不支持由多个 Unicode.Scalar。请参阅 Apple 在 WWDC 2017 视频 What's New in Swift 的字符串讨论中演示的 "family" 字符 (‍‍‍) 或国际国旗字符(例如“”或“”)。多种肤色表情符号也表现出这种行为(例如 vs )。

因此,我会谨慎使用 CharacterSet(这是 "set of Unicode character values for use in search operations")。或者,如果您为了方便而想提供此方法,请注意它无法正确处理由多个 unicode 标量表示的字符。

因此,您可以提供一个扫描器,它提供 skip 方法的 CharacterSetString 再现:

class MyScanner {
    let string: String
    var index: String.Index

    init(_ string: String) {
        self.string = string
        index = string.startIndex
    }

    var remains: String { return String(string[index...]) }

    /// Skip characters in a string
    ///
    /// This rendition is safe to use with strings that have characters
    /// represented by more than one unicode scalar.
    ///
    /// - Parameter skipString: A string with all of the characters to skip.

    func skip(charactersIn skipString: String) {
        while index < string.endIndex, skipString.contains(string[index]) {
            index = string.index(index, offsetBy: 1)
        }
    }

    /// Skip characters in character set
    ///
    /// Note, character sets cannot (yet) include characters that are represented by
    /// more than one unicode scalar (e.g. ‍‍‍ or  or ). If you want to test
    /// for these multi-unicode characters, you have to use the `String` rendition of
    /// this method.
    ///
    /// This will simply stop scanning if it encounters a multi-unicode character in
    /// the string being scanned (because it knows the `CharacterSet` can only represent
    /// single-unicode characters) and you want to avoid false positives (e.g., mistaking
    /// the Jamaican flag, , for the Japanese flag, ).
    ///
    /// - Parameter characterSet: The character set to check for membership.

    func skip(charactersIn characterSet: CharacterSet) {
        while index < string.endIndex,
            string[index].unicodeScalars.count == 1,
            let character = string[index].unicodeScalars.first,
            characterSet.contains(character) {
                index = string.index(index, offsetBy: 1)
        }
    }

}

因此,您的简单示例仍然有效:

let scanner = MyScanner("fizz   buzz fizz")
scanner.skip(charactersIn: CharacterSet.alphanumerics)
scanner.skip(charactersIn: CharacterSet.whitespaces)
print(scanner.remains)  // "buzz fizz"

但如果要跳过的字符可能包含多个 unicode 标量,请使用 String 格式:

let family = "\u{200D}\u{200D}\u{200D}"  // ‍‍‍
let boy = ""

let charactersToSkip = family + boy

let string = boy + family + "foobar"  // ‍‍‍foobar

let scanner = MyScanner(string)
scanner.skip(charactersIn: charactersToSkip)
print(scanner.remains)                // foobar

正如 Michael Waterfall 在下面的评论中指出的那样,CharacterSet 有一个错误,甚至不能正确处理 32 位 Unicode.Scalar 值,这意味着它甚至不能处理单个标量字符如果值超过 0xffff(包括表情符号等),则正确。不过,上面的 String 演绎版可以正确处理这些问题。

Swift 4.2 CharacterSet 扩展函数检查是否包含 Character:

extension CharacterSet {
    func containsUnicodeScalars(of character: Character) -> Bool {
        return character.unicodeScalars.allSatisfy(contains(_:))
    }
}

用法示例:

CharacterSet.decimalDigits.containsUnicodeScalars(of: "3") // true
CharacterSet.decimalDigits.containsUnicodeScalars(of: "a") // false