来自 UTF-16 编码的错误字节
Wrong bytes from UTF-16 encoding
我有一个字符 '' Unicode 值是 U+1F62D 等效二进制是 11111011000101101 。现在我想将这个字符转换为字节数组。我的脚步
1) 由于二进制表示大于 2 个字节我使用 4 个字节
XXXXXXXX XXXXXXX1 11110110 00101101
2) 现在我将所有 'X' 替换为 '0'
00000000 00000001 11110110 00101101
3) 十进制等值
00000000(0) 00000001(1) 11110110(-10) 00101101(45)
这是我的代码
@Test
public void testUtf16With4Bytes() throws Exception {
assertThat(
new String(
new byte[]{0,1,-10,45},
StandardCharsets.UTF_16BE
),
is("")
);
}
这是输出
ava.lang.AssertionError:
Expected: is ""
but: was ""
我错过了什么?
您错过了一些 UTF 字符存储为 surrogate pairs:
In UTF-16, characters in ranges U+0000—U+D7FF and U+E000—U+FFFD are stored as a single 16 bits unit. Non-BMP characters (range U+10000—U+10FFFF) are stored as “surrogate pairs”, two 16 bits units: an high surrogate (in range U+D800—U+DBFF) followed by a low surrogate (in range U+DC00—U+DFFF). A lone surrogate character is invalid in UTF-16, surrogate characters are always written as pairs (high followed by low).
个字符是 U+1F62D
所以它属于 U+10000—U+10FFFF
范围。它用代理对 U+D83D
U+DE2D
表示,因为 byte[]
它将是 [-40, 61, -34, 45]
.
我有一个字符 '' Unicode 值是 U+1F62D 等效二进制是 11111011000101101 。现在我想将这个字符转换为字节数组。我的脚步
1) 由于二进制表示大于 2 个字节我使用 4 个字节
XXXXXXXX XXXXXXX1 11110110 00101101
2) 现在我将所有 'X' 替换为 '0'
00000000 00000001 11110110 00101101
3) 十进制等值
00000000(0) 00000001(1) 11110110(-10) 00101101(45)
这是我的代码
@Test
public void testUtf16With4Bytes() throws Exception {
assertThat(
new String(
new byte[]{0,1,-10,45},
StandardCharsets.UTF_16BE
),
is("")
);
}
这是输出
ava.lang.AssertionError:
Expected: is ""
but: was ""
我错过了什么?
您错过了一些 UTF 字符存储为 surrogate pairs:
In UTF-16, characters in ranges U+0000—U+D7FF and U+E000—U+FFFD are stored as a single 16 bits unit. Non-BMP characters (range U+10000—U+10FFFF) are stored as “surrogate pairs”, two 16 bits units: an high surrogate (in range U+D800—U+DBFF) followed by a low surrogate (in range U+DC00—U+DFFF). A lone surrogate character is invalid in UTF-16, surrogate characters are always written as pairs (high followed by low).
个字符是 U+1F62D
所以它属于 U+10000—U+10FFFF
范围。它用代理对 U+D83D
U+DE2D
表示,因为 byte[]
它将是 [-40, 61, -34, 45]
.