用空格拆分长字符串但不带标点符号

Split long string with spaces but without punctuation

我有一个很长的字符串,我需要用空格分隔,所以我在 ios

中做了这个
let str = """
يَا أَيُّهَا الَّذِينَ آمَنُوا لَا تَقْرَبُوا الصَّلَاةَ وَأَنْتُمْ سُكَارَىٰ حَتَّىٰ تَعْلَمُوا مَا تَقُولُونَ وَلَا جُنُبًا إِلَّا عَابِرِي سَبِيلٍ حَتَّىٰ تَغْتَسِلُوا ۚ وَإِنْ كُنْتُمْ مَرْضَىٰ أَوْ عَلَىٰ سَفَرٍ أَوْ جَاءَ أَحَدٌ مِنْكُمْ مِنَ الْغَائِطِ أَوْ لَامَسْتُمُ النِّسَاءَ فَلَمْ تَجِدُوا مَاءً فَتَيَمَّمُوا صَعِيدًا طَيِّبًا فَامْسَحُوا بِوُجُوهِكُمْ وَأَيْدِيكُمْ ۗ إِنَّ اللَّهَ كَانَ عَفُوًّا غَفُورًا
"""
let count = str.components(separatedBy: " ").count
        
print(count) // 49

它给出了 49,但在 kotlin 中同样的东西在这里给出了 51

val str = getString(R.string.valueHere)

val count = str.split(" ").count()

Log.d("count is " , count.toString()) // 51

<string name="valueHere">يَا أَيُّهَا الَّذِينَ آمَنُوا لَا تَقْرَبُوا الصَّلَاةَ وَأَنْتُمْ سُكَارَىٰ حَتَّىٰ تَعْلَمُوا مَا تَقُولُونَ وَلَا جُنُبًا إِلَّا عَابِرِي سَبِيلٍ حَتَّىٰ تَغْتَسِلُوا ۚ وَإِنْ كُنْتُمْ مَرْضَىٰ أَوْ عَلَىٰ سَفَرٍ أَوْ جَاءَ أَحَدٌ مِنْكُمْ مِنَ الْغَائِطِ أَوْ لَامَسْتُمُ النِّسَاءَ فَلَمْ تَجِدُوا مَاءً فَتَيَمَّمُوا صَعِيدًا طَيِّبًا فَامْسَحُوا بِوُجُوهِكُمْ وَأَيْدِيكُمْ ۗ إِنَّ اللَّهَ كَانَ عَفُوًّا غَفُورًا</string>

我需要 android 中的字数为 49;在 android 中,它似乎计算空格中的装饰字符,如何解决此问题并在 Kotlin 中产生相同的结果?

编辑:

fun getColorRange(): Range<Int> { 
    
    val text =  // my long string here
    val all = text.split (" ")
    val sub = (wordFrom..wordTo).map { all[it] }.joinToString(" ")
    val lower = text.indexOf(sub)
    val upper = lower + sub.length
    return Range<Int>(lower, upper)
}

如果 arr Kotlin 中的长度不同 sub 将是不同的子字符串

通过记录拆分字符串以查看问题所在:

يَا
أَيُّهَا
الَّذِينَ
آمَنُوا
لَا
تَقْرَبُوا
الصَّلَاةَ
وَأَنْتُمْ
سُكَارَىٰ
حَتَّىٰ
تَعْلَمُوا
مَا
تَقُولُونَ
وَلَا
جُنُبًا
إِلَّا
عَابِرِي
سَبِيلٍ
حَتَّىٰ
تَغْتَسِلُوا
ۚ     >>>>>>>>>>>>>>>>>>>>> Problem here
وَإِنْ
كُنْتُمْ
مَرْضَىٰ
أَوْ
عَلَىٰ
سَفَرٍ
أَوْ
جَاءَ
أَحَدٌ
مِنْكُمْ
مِنَ
الْغَائِطِ
أَوْ
لَامَسْتُمُ
النِّسَاءَ
فَلَمْ
تَجِدُوا
مَاءً
فَتَيَمَّمُوا
صَعِيدًا
طَيِّبًا
فَامْسَحُوا
بِوُجُوهِكُمْ
وَأَيْدِيكُمْ
ۗ    >>>>>>>>>>>>>>>>>>>>> Problem here
إِنَّ
اللَّهَ
كَانَ
عَفُوًّا
غَفُورًا

所以,显然问题出在上部变音符号(或准确地说是标记)上,例如 5 5 因为它们不被认为是有效字符。

我相信 Kotlin 版本比 Swift 版本更准确,因为你需要的是:

在 SPACE 上将此字符串作为分隔符(句号)分隔

Swift倾向于做的是它不识别上面的diacritics/markers,即它认为它们什么都不是,并且在拆分字符串时不计算它们。可能还有另一个 Swift 函数可以检测到,但不确定,因为这不是您问题的一部分。

因为你有几个这样的标记;因此 Kotlin 版本比 Swift 多了两个(即 51 而不是 49)。

所以,问题是:如何在拆分之前从字符串中删除上面的 diacritics/markers?

感谢 this answer 列出了这些类型的标记;在 Kotlin 中,您可以使用 String replace() 方法将它们替换为空:

这里有一个片段来修正你的例子:

var str = getString(R.string.valueHere)
str = str
    .replace("\u0615", "") //ARABIC SMALL HIGH TAH
    .replace("\u0616", "") //ARABIC SMALL HIGH LIGATURE ALEF WITH LAM WITH YEH
    .replace("\u0617", "") //ARABIC SMALL HIGH ZAIN
    .replace("\u0618", "") //ARABIC SMALL FATHA
    .replace("\u0619", "") //ARABIC SMALL DAMMA
    .replace("\u061A", "") //ARABIC SMALL KASRA
    .replace("\u06D6", "") //ARABIC SMALL HIGH LIGATURE SAD WITH LAM WITH ALEF MAKSURA
    .replace("\u06D7", "") //ARABIC SMALL HIGH LIGATURE QAF WITH LAM WITH ALEF MAKSURA
    .replace("\u06D8", "") //ARABIC SMALL HIGH MEEM INITIAL FORM
    .replace("\u06D9", "") //ARABIC SMALL HIGH LAM ALEF
    .replace("\u06DA", "") //ARABIC SMALL HIGH JEEM
    .replace("\u06DB", "") //ARABIC SMALL HIGH THREE DOTS
    .replace("\u06DC", "") //ARABIC SMALL HIGH SEEN
    .replace("\u06DD", "") //ARABIC END OF AYAH
    .replace("\u06DE", "") //ARABIC START OF RUB EL HIZB
    .replace("\u06DF", "") //ARABIC SMALL HIGH ROUNDED ZERO
    .replace("\u06E0", "") //ARABIC SMALL HIGH UPRIGHT RECTANGULAR ZERO
    .replace("\u06E1", "") //ARABIC SMALL HIGH DOTLESS HEAD OF KHAH
    .replace("\u06E2", "") //ARABIC SMALL HIGH MEEM ISOLATED FORM
    .replace("\u06E3", "") //ARABIC SMALL LOW SEEN
    .replace("\u06E4", "") //ARABIC SMALL HIGH MADDA
    .replace("\u06E5", "") //ARABIC SMALL WAW
    .replace("\u06E6", "") //ARABIC SMALL YEH
    .replace("\u06E7", "") //ARABIC SMALL HIGH YEH
    .replace("\u06E8", "") //ARABIC SMALL HIGH NOON
    .replace("\u06E9", "") //ARABIC PLACE OF SAJDAH
    .replace("\u06EA", "") //ARABIC EMPTY CENTRE LOW STOP
    .replace("\u06EB", "") //ARABIC EMPTY CENTRE HIGH STOP
    .replace("\u06EC", "") //ARABIC ROUNDED HIGH STOP WITH FILLED CENTRE
    .replace("\u06ED", "") //ARABIC SMALL LOW MEEM

val split = str.split(" ")

val count = str.split(" ").count {
    it.isNotBlank()
}
Log.d("count is ", "$count")

This is the test verification result 在 Kotlin 编译器上

更新:

I have a long string that I need to color range inside it with a different color inside a textView , so split it with spaces get needed words by lower and upper word index, then join them in one string to color their range inside the long string , the above answer did give 49 but it removed important characters mentioned with replace , so any try to tweak your code to consider this ?

因此,如果您遵循上面的方法,您只需要从拆分字符串中删除空格,为此您可以在用空格替换所有标记后使用 filter{} 缩减

fun getColorRange(input: String, wordFrom: Int, wordTo: Int): Range<Int> {
    val text = input
        .replace("\u0615", "") //ARABIC SMALL HIGH TAH
        .replace("\u0616", "") //ARABIC SMALL HIGH LIGATURE ALEF WITH LAM WITH YEH
        .replace("\u0617", "") //ARABIC SMALL HIGH ZAIN
        .replace("\u0618", "") //ARABIC SMALL FATHA
        .replace("\u0619", "") //ARABIC SMALL DAMMA
        .replace("\u061A", "") //ARABIC SMALL KASRA
        .replace("\u06D6", "") //ARABIC SMALL HIGH LIGATURE SAD WITH LAM WITH ALEF MAKSURA
        .replace("\u06D7", "") //ARABIC SMALL HIGH LIGATURE QAF WITH LAM WITH ALEF MAKSURA
        .replace("\u06D8", "") //ARABIC SMALL HIGH MEEM INITIAL FORM
        .replace("\u06D9", "") //ARABIC SMALL HIGH LAM ALEF
        .replace("\u06DA", "") //ARABIC SMALL HIGH JEEM
        .replace("\u06DB", "") //ARABIC SMALL HIGH THREE DOTS
        .replace("\u06DC", "") //ARABIC SMALL HIGH SEEN
        .replace("\u06DD", "") //ARABIC END OF AYAH
        .replace("\u06DE", "") //ARABIC START OF RUB EL HIZB
        .replace("\u06DF", "") //ARABIC SMALL HIGH ROUNDED ZERO
        .replace("\u06E0", "") //ARABIC SMALL HIGH UPRIGHT RECTANGULAR ZERO
        .replace("\u06E1", "") //ARABIC SMALL HIGH DOTLESS HEAD OF KHAH
        .replace("\u06E2", "") //ARABIC SMALL HIGH MEEM ISOLATED FORM
        .replace("\u06E3", "") //ARABIC SMALL LOW SEEN
        .replace("\u06E4", "") //ARABIC SMALL HIGH MADDA
        .replace("\u06E5", "") //ARABIC SMALL WAW
        .replace("\u06E6", "") //ARABIC SMALL YEH
        .replace("\u06E7", "") //ARABIC SMALL HIGH YEH
        .replace("\u06E8", "") //ARABIC SMALL HIGH NOON
        .replace("\u06E9", "") //ARABIC PLACE OF SAJDAH
        .replace("\u06EA", "") //ARABIC EMPTY CENTRE LOW STOP
        .replace("\u06EB", "") //ARABIC EMPTY CENTRE HIGH STOP
        .replace("\u06EC", "") //ARABIC ROUNDED HIGH STOP WITH FILLED CENTRE
        .replace("\u06ED", "") //ARABIC SMALL LOW MEEM

    val all = text.split(" ").filter { it.isNotBlank() } // Remove the blanks (i.e. the markers)
    val sub = (wordFrom..wordTo).map { all[it] }.joinToString(" ")

    Log.d("LOG_TAG", "getColorRange: $sub")
    val range = text.indexOf(sub[0], wordFrom)
    return Range<Int>(range, range + sub.length)
}

示例用法:

getColorRange(str, 18, 22)

// Output:
//  حَتَّىٰ تَغْتَسِلُوا وَإِنْ كُنْتُمْ مَرْضَىٰ

getColorRange(str, 0, 48) // Should return the entire string as this is the total number of words

// Output:
// يَا أَيُّهَا الَّذِينَ آمَنُوا لَا تَقْرَبُوا الصَّلَاةَ وَأَنْتُمْ سُكَارَىٰ حَتَّىٰ تَعْلَمُوا مَا تَقُولُونَ وَلَا جُنُبًا إِلَّا عَابِرِي سَبِيلٍ حَتَّىٰ تَغْتَسِلُوا وَإِنْ كُنْتُمْ مَرْضَىٰ أَوْ عَلَىٰ سَفَرٍ أَوْ جَاءَ أَحَدٌ مِنْكُمْ مِنَ الْغَائِطِ أَوْ لَامَسْتُمُ النِّسَاءَ فَلَمْ تَجِدُوا مَاءً فَتَيَمَّمُوا صَعِيدًا طَيِّبًا فَامْسَحُوا بِوُجُوهِكُمْ وَأَيْدِيكُمْ إِنَّ اللَّهَ كَانَ عَفُوًّا غَفُورًا

还要注意 range 值有问题,因为 sub 是一个列表,而不是字符串,所以下面是错误的

val range = text.indexOf(sub)

相反,您需要获取 sub 中第一项的索引,并且从 wordFrom 开始,而不是从字符串的开头:

val range = text.indexOf(sub[0], wordFrom)