在 Java/Scala 字符串中确定为长度为 2 的单个汉字

Question

我想从一个字符串中拆分出所有的汉字，但我遇到了一个奇怪的字符情况

scala> ""
res1: String = 

scala> res1.length
res2: Int = 2

scala> res1.getBytes
res3: Array[Byte] = Array(-16, -91, -111, -82)

scala> res1(0)
res4: Char = ?

scala> res1(1)
res5: Char = ?

是单个字符，但Java/Scala判断为两个未知字符。通常我看到汉字在UTF-8中占用三个字节，但是这个字符需要四个。

因此，我无法拆分字符串并找到这个单个字符。更糟糕的是，当使用 myString.replaceAll("[^\p{script=Han}]", "") 踢出所有非汉字时，第二部分被替换并成为无效的字符串。

有什么解决办法吗？我在 Ubuntu.

上使用 openjdk-8-jdk

Answer 1

可能这个字符在 UTF-8 中无效或不受支持，但在 UTF-16 中受支持，导致 JVM 和 Scala 之间存在一些不兼容问题shell。您的系统是大端还是小端？您也可以尝试获取字符的 Unicode 代码点并检查它是 UTF-8 还是 UTF-16。此外，中文有复合字母，如日文汉字和注音假名，因此这也可能是您问题的一部分。

Answer 2

我想你想 replace/split string.That 你可以在不知道 string.Because java 的长度的情况下使用字符串序列来替换特定的字符或字符串中的 char 序列。例如：-`public class 测试 {

public static void main(String[] args) {


    String s="";
    System.out.println(s.replace("", "k"));

}
}

` 如果你想拆分字符串然后去 stringtokenizer.For 示例 :-

StringTokenizer st= new StringTokenizer("your sentence or String","the problematic char/string");

Answer 3

对于你应该使用的长度

string.codePointCount(0, string.length());

对于替换，最好避免基于字符的正则表达式。您可以编写一个依赖 String#offsetByCodePoints() 的循环，并根据 String.codePointAt() 和 Character.isIdeographic().

手动删除字符

Answer 4

Java 标准库 unicode 支持早于当前标准，因此对 astral（非 BMP）字符的支持是......有限的；如您所见，一些 API 会将它们视为单独的代理对。如果您要进行大量的字符串操作，最好使用 ICU4J，据我所知，它提供了具有完整 unicode 支持的正则表达式。

Answer 5

您遇到了代理对。该字符是 U+2546E，如您所见，它比 2^16 大很多。它在 Java 或 Scala 字符串中表示为序列 0xD855 0xDC6E。

如果你想要一个透明地处理这类事情的正则表达式库，我碰巧知道在哪里可以找到一个：TCL regex ported to Java。如果你不想去那里，你需要使用java中String和Character的Code Point方法来导航。

Answer 6

根据@Marko 的回答，这里有一个分割字符串的例子：

scala> val x = "硓abc"
x: String = 硓abc

scala> (0 to x.codePointCount(0, x.length)).map(c => x.offsetByCodePoints(0, c)).sliding(2).map(w => x.substring(w.head, w.last)).toList
res1: List[String] = List(硓, , a, b, c)

并判断每个字符是否为中日韩文：

scala> (0 until x.codePointCount(0, x.length)).map(c => x.offsetByCodePoints(0, c)).map(i => Character.isIdeographic(x.codePointAt(i))).toList
res2: List[Boolean] = List(true, true, false, false, false)

在 Java/Scala 字符串中确定为长度为 2 的单个汉字

Single Chinese character determined as length 2 in Java/Scala String

java

scala

character-encoding