使用分隔符拆分具有特殊字符的 utf16 字符串

Split utf16 string with special characters using delimiter

我想将此 utf-16 字符串拆分为 Swift 5

ddd¾̷̱̲͈́͌͠ͰͿΔδόcpϫЍа

定界符:“¾”

我试过以下代码

let Arr =  "ddd¾̷̱̲͈́͌͠ͰͿΔδόϡϫЍа".split{[=10=] == "¾"}.map(String.init)

let Arr = "ddd¾̷̱̲͈́͌͠ͰͿΔδόϡϫЍа".components(separatedBy: "¾")

但都失败了

我做了延期!这没有将 Ѝ 更改为 И.

的副作用
let delimiter: Character = "¾" /// the delim
let string = "ddd¾̷̱̲͈́͌͠ͰͿΔδόϡϫЍа"
        
let arr = string.components(separatedBySpecialCharacter: delimiter)
print(arr) /// ["ddd", "ͰͿΔδόϡϫЍа"]
extension String {
    func components(separatedBySpecialCharacter delimiter: Character) -> [String] {

        let cleanedString = self.folding(options: .diacriticInsensitive, locale: .current) /// remove all accents and diacritics
        
        let indicesOfDelimiter = cleanedString.indicesOf(string: String(delimiter)) /// get the indices of the full String where the delimiter is
        
        var stringCharacters = Array(self) /// split the full String into an array
        for index in indicesOfDelimiter {
            stringCharacters[index] = delimiter /// replace all occurrences of the accented delimited with a clean delimiter
        }
        
        let delimiterCleanedString = String(stringCharacters) /// make the array of the full String, with cleaned delimiters, back into a String
        let separatedComponents = delimiterCleanedString.components(separatedBy: "¾") /// finally get the components
        
        return separatedComponents
    }
    
    /// get indices of a String inside a String
    /// from 
    func indicesOf(string: String) -> [Int] {
        var indices = [Int]()
        var searchStartIndex = self.startIndex
        
        while searchStartIndex < self.endIndex,
            let range = self.range(of: string, range: searchStartIndex..<self.endIndex),
            !range.isEmpty
        {
            let index = distance(from: self.startIndex, to: range.lowerBound)
            indices.append(index)
            searchStartIndex = range.upperBound
        }
        
        return indices
    }
}

旧答案:

“ddd¾̷̱̲͈́͌͠ͰͿΔδόcpϫЍа”里面的“¾̷̱̲͈͌͠”里面有很多diacritics/zalgo的文字。你可以先这样清理它:

let string = "ddd¾̷̱̲͈́͌͠ͰͿΔδόϡϫЍа"
let cleanedString = string.folding(options: .diacriticInsensitive, locale: .current)
print(cleanedString)

结果:

ddd¾ͰͿΔδοϡϫИа

现在,您可以在清理后的字符串上使用 components(separatedBy: "¾")

let arr = cleanedString.components(separatedBy: "¾")
print(arr)

结果:

["ddd", "ͰͿΔδοϡϫИа"]

请注意,这也会将 Ѝ 更改为 И。我看看有没有更好的解决办法。

String的元素是Character。字符是一个扩展的字素簇,这意味着它由所有组合字符组成。此字符串中的字符是 ¾̷̱̲͈́͌͠,因此当您尝试在 ¾ 上拆分时,找不到它。

我相信您要操作的是 UnicodeScalars,它们是单独的代码点。为此,您需要先调用 .unicodeScalars:

let arr = "ddd¾̷̱̲͈́͌͠ͰͿΔδόϡϫЍа".unicodeScalars.split(separator: "¾").map(String.init)
// ["ddd", "̷̱̲͈́͌͠ͰͿΔδόϡϫЍа"]

请注意,您在此处发布的字符串是 UTF-8,而不是 UTF-16。 Swift 不能直接对 UTF-16 文字进行操作(您通常将它们存储为数据或 [UInt16] 然后将它们转换为字符串)。但是,我认为这不会改变您的问题。