如何从Java中的String中识别并过滤掉40000- DFFFF范围内的unicode字符

How to identify and filter out unicode characters falling in the range of 40000–​DFFFF from a String in Java

java 中是否有任何方法可以查明字符是否位于 Unicode 的 plane4 到 plane13 之间。根据 https://en.wikipedia.org/wiki/Unicode_block 的 plane4 到 plane13 的范围是 - 40000– DFFFF

在这段代码中,我试图将一个十六进制值分配给一个字符,但是当我将它转换回 int 时,我没有得到相同的 int 值。 DFFFF 的十进制形式是 917503。但是在将 char 转换回 int 时,我得到十进制值 65535。不确定为什么在将 char 转换回 int 时值会发生变化。有人可以给我一些想法吗?根据 unicode 范围 40000– DFFFF 当前未定义。这是这种奇怪行为的原因吗?

实际上我想实现的用例是从输入字符串中过滤掉任何在 40000– DFFFF 范围内的字符。

是否有开箱即用的开源库?如果可以提供任何帮助,我们将不胜感激。

int intHex = 0xDFFFF;
    char c = (char)intHex;
    System.out.println((int)c);

谢谢

查看 documentation of the Character class:

The char data type (and therefore the value that a Character object encapsulates) are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities. The Unicode Standard has since been changed to allow for characters whose representation requires more than 16 bits. The range of legal code points is now U+0000 to U+10FFFF, known as Unicode scalar value. (Refer to the definition of the U+n notation in the Unicode Standard.)

The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Characters whose code points are greater than U+FFFF are called supplementary characters. The Java platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).

A char value, therefore, represents Basic Multilingual Plane (BMP) code points, including the surrogate code points, or code units of the UTF-16 encoding. An int value represents all Unicode code points, including supplementary code points. The lower (least significant) 21 bits of int are used to represent Unicode code points and the upper (most significant) 11 bits must be zero. Unless otherwise specified, the behavior with respect to supplementary characters and surrogate char values is as follows:

The methods that only accept a char value cannot support supplementary characters. They treat char values from the surrogate ranges as undefined characters. For example, Character.isLetter('\uD840') returns false, even though this specific value if followed by any low-surrogate value in a string would represent a letter. The methods that accept an int value support all Unicode characters, including supplementary characters. For example, Character.isLetter(0x2F81A) returns true because the code point value represents a letter (a CJK ideograph).

这意味着单个char是不够的,因为它只有16位。您需要一个 int 值来保存更大的值。 65535 = 2^16 是您可以保存在无符号 16 位整数数据类型中的最大值。

欢迎来到UTF-16的世界,IT行业的一次重大事故。

许多操作系统和编程语言都将字符类型定义为 16 位长,而 16 位已经很明显不足以表示地球上使用的所有字母。 Java就是其中之一。

Unicode 同时也在发展。它需要 32 位来表示通俗地称为 字母 的东西,它被称为代码点。出于兼容性原因,Java 无法将 char 类型从 16 位更改为 32 位。相反,他们将其保留为 16 位并将其重新定义为 UTF-16 编码(而不是直接的 UCS-2)表示形式。

简而言之:像 U+DFFFF 这样的代码点需要超过 16 位,不能用单个 char 表示。所以从 char 切换到代码点,在 Java 中表示为 int:

final int length = s.length();
for (int offset = 0; offset < length; ) {
   final int codepoint = s.codePointAt(offset);
   if (codepoint >= 0x40000 && codepoint <= 0xdffff) {
       // do something with the codepoint
   }
   offset += Character.charCount(codepoint);
}
    int intHex = 0xDFFFF; // This is called a Unicode code point.

    char[] chars = Character.toChars(intHex); // In this case a pair.
    String s = new String(chars);
    int[] codePoints = new int[] {intHex};
    String t = new String(codePoints, 0, codePoints.length);

java 中的字符采用 UTF-16,并确保(与 UTF-8 一样)Unicode 符号(代码点)的序列不包含序列中容易混淆的其他字符(代理项)一对字符)。

最好使用字符串,并提取代码点。

    int[] codePoints = s.codePoints().filter(cp -> cp < 0x40000).toArray();
    String t = new String(codePoints, 0, codePoints.length);