将 ucs-4 转换为 ucs-2

Question

ucs-4字符''的unicode值是0001f923，在intelliJ IDEA中复制到java代码时会自动更改为\uD83E\uDD23对应的值。

Java只支持ucs-2，所以出现了ucs-4到ucs-2的转换。

我想知道转换的逻辑，但是没找到material。

Answer 1

https://en.wikipedia.org/wiki/UTF-16#U+010000_to_U+10FFFF

U+010000 to U+10FFFF

0x10000 is subtracted from the code point (U), leaving a 20-bit number (U') in the range 0x00000–0xFFFFF. U is defined to be no greater than 0x10FFFF.

The high ten bits (in the range 0x000–0x3FF) are added to 0xD800 to give the first 16-bit code unit or high surrogate (W1), which will be in the range 0xD800–0xDBFF.

The low ten bits (also in the range 0x000–0x3FF) are added to 0xDC00 to give the second 16-bit code unit or low surrogate (W2), which will be in the range 0xDC00–0xDFFF.

现在输入代码点\U1F923:

\U1F923 - \U10000=\UF923
\UF923 = 1111100100100011 = 00001111100100100011 = [0000111110][0100100011] = [\U3E][\U123]
\UD800 + \U3E = \UD83E
\UDC00 + \U123 = \UDD23
结果：\UD83E\UDD23

编程：

public static void main(String[] args) {
    int input = 0x1f923;
    int x = input - 0x10000;

    int highTenBits = x >> 10;
    int lowTenBits = x & ((1 << 10) - 1);

    int high = highTenBits + 0xd800;
    int low = lowTenBits + 0xdc00;

    System.out.println(String.format("[%x][%x]", high, low));
}

Answer 2

虽然 String 包含 Unicode 作为 char 数组，其中 char 是两字节的 UTF-16BE 编码，但也支持 UCS4。

UCS4: UTF-32, "code points":

Unicode 代码点 UCS4 在 java 中表示为 int。

int[] ucs4 = new int[] {0x0001_f923};
String s = new String(ucs4, 0, ucs4.length);
ucs4 = s.codePoints().toArray();

UTF-16 和 UTF-8 代码点的编码和转换需要分别为 2 字节或 1 字节值的更长序列。选择编码使得 2/1 字节的值将不同于任何其他值。这意味着这样的值不会错误地匹配 "/" 或任何其他字符串搜索。这是通过高位 以 1... 开始，然后是大端格式的代码点位（最重要的在前）实现的。

与搜索 UCS4 和 UCS2 相比，搜索 UTF-16 将产生有关所用算法的信息。

将 ucs-4 转换为 ucs-2

convert ucs-4 to ucs-2

java

ucs2

ucs-4