当用 3 个字符字节编码 UTF-8 字符串时,为什么长度只增加 1

When encoding a UTF-8 string with 3 character bytes how come the length only increases by 1

所以我试图确保我正确理解编码,所以我写了一个示例测试:

public class TestEncoding {

    public static void main(String[] args) throws UnsupportedEncodingException {
        TestEncoding testEncoding = new TestEncoding();
        testEncoding.isLengthDifferenceBetweenUTF16UTF32();
    }

    private void isLengthDifferenceBetweenUTF16UTF32() throws UnsupportedEncodingException {
        String eightBitString = new String("Hi how are you?ࢤ".getBytes(StandardCharsets.UTF_8));
        String sixteenBitString = new String("Hi how are you?ࢤ".getBytes(StandardCharsets.UTF_16LE));
        String thirtyTwoBitString= new String("Hi how are you?ࢤ".getBytes("UTF-32"));

        System.err.println("8 bit: " + eightBitString.length());
        System.err.println("16 bit: " + sixteenBitString.length());
        System.err.println("32 bit: " + thirtyTwoBitString.length());
    }
}

然后对于输出我得到:

8 bit: 16
16 bit: 32
32 bit: 64

我的问题是为什么 Hi how are you? 末尾的特殊字符 没有变成 15,因为 Hi how are you? + 3 是特殊字符,给我共18个

String eightBitString = new String("Hi how are you?ࢤ".getBytes(StandardCharsets.UTF_8));
String sixteenBitString = new String("Hi how are you?ࢤ".getBytes(StandardCharsets.UTF_16LE));
String thirtyTwoBitString= new String("Hi how are you?ࢤ".getBytes("UTF-32"));

这些行采用字符串,将它们转换为指定字符集中的字节,然后将它们转换回 JVM 默认字符集中的字符串。

这取决于默认值,但字节序列在默认字符集中可能无效。在这种情况下,字符串将包含无效序列的占位符字符。它们在结果字符串中看起来像 ,它是一个字符。

例如,如果默认字符集是 UTF-8,则这 3 个字符串是:

Hi how are you?ࢤ
Hi how are you?�
Hi how are you?�

如果您想比较这些字符集中字节表示的长度,请不要转换回字符串:

byte[] eightBit = "Hi how are you?ࢤ".getBytes(StandardCharsets.UTF_8);
System.out.println(eightBit.length);

等等