使用分隔符拆分具有特殊字符的 utf16 字符串
Split utf16 string with special characters using delimiter
我想将此 utf-16 字符串拆分为 Swift 5
ddd¾̷̱̲͈́͌͠ͰͿΔδόcpϫЍа
定界符:“¾”
我试过以下代码
let Arr = "ddd¾̷̱̲͈́͌͠ͰͿΔδόϡϫЍа".split{[=10=] == "¾"}.map(String.init)
let Arr = "ddd¾̷̱̲͈́͌͠ͰͿΔδόϡϫЍа".components(separatedBy: "¾")
但都失败了
我做了延期!这没有将 Ѝ
更改为 И
.
的副作用
let delimiter: Character = "¾" /// the delim
let string = "ddd¾̷̱̲͈́͌͠ͰͿΔδόϡϫЍа"
let arr = string.components(separatedBySpecialCharacter: delimiter)
print(arr) /// ["ddd", "ͰͿΔδόϡϫЍа"]
extension String {
func components(separatedBySpecialCharacter delimiter: Character) -> [String] {
let cleanedString = self.folding(options: .diacriticInsensitive, locale: .current) /// remove all accents and diacritics
let indicesOfDelimiter = cleanedString.indicesOf(string: String(delimiter)) /// get the indices of the full String where the delimiter is
var stringCharacters = Array(self) /// split the full String into an array
for index in indicesOfDelimiter {
stringCharacters[index] = delimiter /// replace all occurrences of the accented delimited with a clean delimiter
}
let delimiterCleanedString = String(stringCharacters) /// make the array of the full String, with cleaned delimiters, back into a String
let separatedComponents = delimiterCleanedString.components(separatedBy: "¾") /// finally get the components
return separatedComponents
}
/// get indices of a String inside a String
/// from
func indicesOf(string: String) -> [Int] {
var indices = [Int]()
var searchStartIndex = self.startIndex
while searchStartIndex < self.endIndex,
let range = self.range(of: string, range: searchStartIndex..<self.endIndex),
!range.isEmpty
{
let index = distance(from: self.startIndex, to: range.lowerBound)
indices.append(index)
searchStartIndex = range.upperBound
}
return indices
}
}
旧答案:
“ddd¾̷̱̲͈́͌͠ͰͿΔδόcpϫЍа”里面的“¾̷̱̲͈͌͠”里面有很多diacritics/zalgo的文字。你可以先这样清理它:
let string = "ddd¾̷̱̲͈́͌͠ͰͿΔδόϡϫЍа"
let cleanedString = string.folding(options: .diacriticInsensitive, locale: .current)
print(cleanedString)
结果:
ddd¾ͰͿΔδοϡϫИа
现在,您可以在清理后的字符串上使用 components(separatedBy: "¾")
。
let arr = cleanedString.components(separatedBy: "¾")
print(arr)
结果:
["ddd", "ͰͿΔδοϡϫИа"]
请注意,这也会将 Ѝ
更改为 И
。我看看有没有更好的解决办法。
String的元素是Character。字符是一个扩展的字素簇,这意味着它由所有组合字符组成。此字符串中的字符是 ¾̷̱̲͈́͌͠
,因此当您尝试在 ¾
上拆分时,找不到它。
我相信您要操作的是 UnicodeScalars,它们是单独的代码点。为此,您需要先调用 .unicodeScalars
:
let arr = "ddd¾̷̱̲͈́͌͠ͰͿΔδόϡϫЍа".unicodeScalars.split(separator: "¾").map(String.init)
// ["ddd", "̷̱̲͈́͌͠ͰͿΔδόϡϫЍа"]
请注意,您在此处发布的字符串是 UTF-8,而不是 UTF-16。 Swift 不能直接对 UTF-16 文字进行操作(您通常将它们存储为数据或 [UInt16]
然后将它们转换为字符串)。但是,我认为这不会改变您的问题。
我想将此 utf-16 字符串拆分为 Swift 5
ddd¾̷̱̲͈́͌͠ͰͿΔδόcpϫЍа
定界符:“¾”
我试过以下代码
let Arr = "ddd¾̷̱̲͈́͌͠ͰͿΔδόϡϫЍа".split{[=10=] == "¾"}.map(String.init)
let Arr = "ddd¾̷̱̲͈́͌͠ͰͿΔδόϡϫЍа".components(separatedBy: "¾")
但都失败了
我做了延期!这没有将 Ѝ
更改为 И
.
let delimiter: Character = "¾" /// the delim
let string = "ddd¾̷̱̲͈́͌͠ͰͿΔδόϡϫЍа"
let arr = string.components(separatedBySpecialCharacter: delimiter)
print(arr) /// ["ddd", "ͰͿΔδόϡϫЍа"]
extension String {
func components(separatedBySpecialCharacter delimiter: Character) -> [String] {
let cleanedString = self.folding(options: .diacriticInsensitive, locale: .current) /// remove all accents and diacritics
let indicesOfDelimiter = cleanedString.indicesOf(string: String(delimiter)) /// get the indices of the full String where the delimiter is
var stringCharacters = Array(self) /// split the full String into an array
for index in indicesOfDelimiter {
stringCharacters[index] = delimiter /// replace all occurrences of the accented delimited with a clean delimiter
}
let delimiterCleanedString = String(stringCharacters) /// make the array of the full String, with cleaned delimiters, back into a String
let separatedComponents = delimiterCleanedString.components(separatedBy: "¾") /// finally get the components
return separatedComponents
}
/// get indices of a String inside a String
/// from
func indicesOf(string: String) -> [Int] {
var indices = [Int]()
var searchStartIndex = self.startIndex
while searchStartIndex < self.endIndex,
let range = self.range(of: string, range: searchStartIndex..<self.endIndex),
!range.isEmpty
{
let index = distance(from: self.startIndex, to: range.lowerBound)
indices.append(index)
searchStartIndex = range.upperBound
}
return indices
}
}
旧答案:
“ddd¾̷̱̲͈́͌͠ͰͿΔδόcpϫЍа”里面的“¾̷̱̲͈͌͠”里面有很多diacritics/zalgo的文字。你可以先这样清理它:
let string = "ddd¾̷̱̲͈́͌͠ͰͿΔδόϡϫЍа"
let cleanedString = string.folding(options: .diacriticInsensitive, locale: .current)
print(cleanedString)
结果:
ddd¾ͰͿΔδοϡϫИа
现在,您可以在清理后的字符串上使用 components(separatedBy: "¾")
。
let arr = cleanedString.components(separatedBy: "¾")
print(arr)
结果:
["ddd", "ͰͿΔδοϡϫИа"]
请注意,这也会将 Ѝ
更改为 И
。我看看有没有更好的解决办法。
String的元素是Character。字符是一个扩展的字素簇,这意味着它由所有组合字符组成。此字符串中的字符是 ¾̷̱̲͈́͌͠
,因此当您尝试在 ¾
上拆分时,找不到它。
我相信您要操作的是 UnicodeScalars,它们是单独的代码点。为此,您需要先调用 .unicodeScalars
:
let arr = "ddd¾̷̱̲͈́͌͠ͰͿΔδόϡϫЍа".unicodeScalars.split(separator: "¾").map(String.init)
// ["ddd", "̷̱̲͈́͌͠ͰͿΔδόϡϫЍа"]
请注意,您在此处发布的字符串是 UTF-8,而不是 UTF-16。 Swift 不能直接对 UTF-16 文字进行操作(您通常将它们存储为数据或 [UInt16]
然后将它们转换为字符串)。但是,我认为这不会改变您的问题。