将 ucs-4 转换为 ucs-2
convert ucs-4 to ucs-2
ucs-4字符''的unicode值是0001f923
,在intelliJ IDEA中复制到java代码时会自动更改为\uD83E\uDD23
对应的值。
Java只支持ucs-2,所以出现了ucs-4到ucs-2的转换。
我想知道转换的逻辑,但是没找到material。
https://en.wikipedia.org/wiki/UTF-16#U+010000_to_U+10FFFF
U+010000 to U+10FFFF
- 0x10000 is subtracted from the code point (U), leaving a 20-bit number (U') in the range 0x00000–0xFFFFF. U is defined to be no
greater than 0x10FFFF.
- The high ten bits (in the range 0x000–0x3FF) are added to 0xD800 to give the first 16-bit code unit or high surrogate (W1), which will be
in the range 0xD800–0xDBFF.
- The low ten bits (also in the range 0x000–0x3FF) are added to 0xDC00 to give the second 16-bit code unit or low surrogate (W2),
which will be in the range 0xDC00–0xDFFF.
现在输入代码点\U1F923:
- \U1F923 - \U10000=\UF923
- \UF923 = 1111100100100011 = 00001111100100100011 = [0000111110][0100100011] = [\U3E][\U123]
- \UD800 + \U3E = \UD83E
- \UDC00 + \U123 = \UDD23
- 结果:\UD83E\UDD23
编程:
public static void main(String[] args) {
int input = 0x1f923;
int x = input - 0x10000;
int highTenBits = x >> 10;
int lowTenBits = x & ((1 << 10) - 1);
int high = highTenBits + 0xd800;
int low = lowTenBits + 0xdc00;
System.out.println(String.format("[%x][%x]", high, low));
}
虽然 String
包含 Unicode 作为 char
数组,其中 char
是两字节的 UTF-16BE 编码,但也支持 UCS4。
UCS4: UTF-32, "code points":
Unicode 代码点 UCS4 在 java 中表示为 int
。
int[] ucs4 = new int[] {0x0001_f923};
String s = new String(ucs4, 0, ucs4.length);
ucs4 = s.codePoints().toArray();
UTF-16 和 UTF-8 代码点的编码和转换需要分别为 2 字节或 1 字节值的更长序列。
选择编码使得 2/1 字节的值将不同于任何其他值。这意味着这样的值不会错误地匹配 "/"
或任何其他字符串搜索。这是通过高位 以 1...
开始,然后是大端格式的代码点位(最重要的在前)实现的。
与搜索 UCS4 和 UCS2 相比,搜索 UTF-16 将产生有关所用算法的信息。
ucs-4字符''的unicode值是0001f923
,在intelliJ IDEA中复制到java代码时会自动更改为\uD83E\uDD23
对应的值。
Java只支持ucs-2,所以出现了ucs-4到ucs-2的转换。
我想知道转换的逻辑,但是没找到material。
https://en.wikipedia.org/wiki/UTF-16#U+010000_to_U+10FFFF
U+010000 to U+10FFFF
- 0x10000 is subtracted from the code point (U), leaving a 20-bit number (U') in the range 0x00000–0xFFFFF. U is defined to be no greater than 0x10FFFF.
- The high ten bits (in the range 0x000–0x3FF) are added to 0xD800 to give the first 16-bit code unit or high surrogate (W1), which will be in the range 0xD800–0xDBFF.
- The low ten bits (also in the range 0x000–0x3FF) are added to 0xDC00 to give the second 16-bit code unit or low surrogate (W2), which will be in the range 0xDC00–0xDFFF.
现在输入代码点\U1F923:
- \U1F923 - \U10000=\UF923
- \UF923 = 1111100100100011 = 00001111100100100011 = [0000111110][0100100011] = [\U3E][\U123]
- \UD800 + \U3E = \UD83E
- \UDC00 + \U123 = \UDD23
- 结果:\UD83E\UDD23
编程:
public static void main(String[] args) {
int input = 0x1f923;
int x = input - 0x10000;
int highTenBits = x >> 10;
int lowTenBits = x & ((1 << 10) - 1);
int high = highTenBits + 0xd800;
int low = lowTenBits + 0xdc00;
System.out.println(String.format("[%x][%x]", high, low));
}
虽然 String
包含 Unicode 作为 char
数组,其中 char
是两字节的 UTF-16BE 编码,但也支持 UCS4。
UCS4: UTF-32, "code points":
Unicode 代码点 UCS4 在 java 中表示为 int
。
int[] ucs4 = new int[] {0x0001_f923};
String s = new String(ucs4, 0, ucs4.length);
ucs4 = s.codePoints().toArray();
UTF-16 和 UTF-8 代码点的编码和转换需要分别为 2 字节或 1 字节值的更长序列。
选择编码使得 2/1 字节的值将不同于任何其他值。这意味着这样的值不会错误地匹配 "/"
或任何其他字符串搜索。这是通过高位 以 1...
开始,然后是大端格式的代码点位(最重要的在前)实现的。
与搜索 UCS4 和 UCS2 相比,搜索 UTF-16 将产生有关所用算法的信息。