Java UTF-16 字符串总是使用 4 个字节而不是 2 个字节
Java UTF-16 String always use 4 bytes instead of 2 bytes
我有一个简单的测试
@Test
public void utf16SizeTest() throws Exception {
final String test = "п";
// 'п' = U+043F according to unicode table
// 43F to binary = 0100 0011 1111 (length is 11)
// ADD '0' so length should be = 16
// 0000 0100 0011 1111
// 00000100(2) 00111111(2)
// 4(10) 63(10)
final byte[] bytes = test.getBytes("UTF-16");
for (byte aByte : bytes) {
System.out.println(aByte);
}
}
如您所见,我首先将 'п' 转换为二进制,然后在 length != 16
.
时添加尽可能多的空字节
A 期望输出为 4 , 63
但实际的是:
-2
-1
4
63
我做错了什么?
如果你尝试:
final String test = "ппп";
你会发现-2 -1
只出现在开头:
-2
-1
4
63
4
63
4
63
-2 是 0xFE
,-1 是 0xFF
。它们一起构成 BOM (Byte_order_mark)
:
In UTF-16, a BOM (U+FEFF) may be placed as the first character of a
file or character stream to indicate the endianness (byte order) of
all the 16-bit code units of the file or stream. If an attempt is made
to read this stream with the wrong endianness, the bytes will be
swapped, thus delivering the character U+FFFE, which is defined by
Unicode as a "non character" that should never appear in the text.
test.getBytes("UTF-16");
在编码字节时默认使用 Big Endian,因此在前面包含一个 BOM,以便后面的处理器可以知道使用了 Big Endian。
您可以使用 UTF-16LE
or UTF-16BE
显式指定字节序,从而避免在输出中出现 BOM:
final byte[] bytes = test.getBytes("UTF-16BE");
The UTF-16
charsets use sixteen-bit quantities and are therefore sensitive to byte order. In these encodings the byte order of a stream may be indicated by an initial byte-order mark represented by the Unicode character '\uFEFF'
. Byte-order marks are handled as follows:
When decoding, the UTF-16BE
and UTF-16LE
charsets interpret the initial byte-order marks as a ZERO-WIDTH NON-BREAKING SPACE
; when encoding, they do not write byte-order marks.
When decoding, the UTF-16
charset interprets the byte-order mark at the beginning of the input stream to indicate the byte-order of the stream but defaults to big-endian if there is no byte-order mark; when encoding, it uses big-endian byte order and writes a big-endian byte-order mark.
我有一个简单的测试
@Test
public void utf16SizeTest() throws Exception {
final String test = "п";
// 'п' = U+043F according to unicode table
// 43F to binary = 0100 0011 1111 (length is 11)
// ADD '0' so length should be = 16
// 0000 0100 0011 1111
// 00000100(2) 00111111(2)
// 4(10) 63(10)
final byte[] bytes = test.getBytes("UTF-16");
for (byte aByte : bytes) {
System.out.println(aByte);
}
}
如您所见,我首先将 'п' 转换为二进制,然后在 length != 16
.
A 期望输出为 4 , 63
但实际的是:
-2
-1
4
63
我做错了什么?
如果你尝试:
final String test = "ппп";
你会发现-2 -1
只出现在开头:
-2
-1
4
63
4
63
4
63
-2 是 0xFE
,-1 是 0xFF
。它们一起构成 BOM (Byte_order_mark)
:
In UTF-16, a BOM (U+FEFF) may be placed as the first character of a file or character stream to indicate the endianness (byte order) of all the 16-bit code units of the file or stream. If an attempt is made to read this stream with the wrong endianness, the bytes will be swapped, thus delivering the character U+FFFE, which is defined by Unicode as a "non character" that should never appear in the text.
test.getBytes("UTF-16");
在编码字节时默认使用 Big Endian,因此在前面包含一个 BOM,以便后面的处理器可以知道使用了 Big Endian。
您可以使用 UTF-16LE
or UTF-16BE
显式指定字节序,从而避免在输出中出现 BOM:
final byte[] bytes = test.getBytes("UTF-16BE");
The
UTF-16
charsets use sixteen-bit quantities and are therefore sensitive to byte order. In these encodings the byte order of a stream may be indicated by an initial byte-order mark represented by the Unicode character'\uFEFF'
. Byte-order marks are handled as follows:
When decoding, the
UTF-16BE
andUTF-16LE
charsets interpret the initial byte-order marks as aZERO-WIDTH NON-BREAKING SPACE
; when encoding, they do not write byte-order marks.When decoding, the
UTF-16
charset interprets the byte-order mark at the beginning of the input stream to indicate the byte-order of the stream but defaults to big-endian if there is no byte-order mark; when encoding, it uses big-endian byte order and writes a big-endian byte-order mark.