字符串字节编码问题

Question

鉴于我有以下功能

static void fun(String str) {
        System.out.println(String.format("%s | length in String: %d | length in bytes: %d | bytes: %s", str, str.length(), str.getBytes().length, Arrays.toString(str.getBytes())));
    }

在调用 fun("ó"); 时它的输出是

ó | length in String: 1 | length in bytes: 2 | bytes: [-61, -77]

所以这意味着字符 ó 需要 2 个字节来表示，并且根据字符 class 文档，在 java 中默认也是 UTF-16，考虑到当我遵循

System.out.println(new String("ó".getBytes(), StandardCharsets.UTF_16));// output=쎳
System.out.println(new String("ó".getBytes(), StandardCharsets.ISO_8859_1));// output=Ã³
System.out.println(new String("ó".getBytes(), StandardCharsets.US_ASCII));// output=��
System.out.println(new String("ó".getBytes(), StandardCharsets.UTF_8));// output=ó
System.out.println(new String("ó".getBytes(), StandardCharsets.UTF_16BE));// output=쎳
System.out.println(new String("ó".getBytes(), StandardCharsets.UTF_16LE));// output=돃

为什么 UTF_16、UTF_16BE、UTF_16LE 字符集无法正确解码字节，因为字节代表 16 位长度的字符？以及 UTF-8 如何能够正确解码它，因为 UTF-8 认为每个字符只有 8 位长，所以它应该打印 2 个字符（每个字节 1 个字符），如 ISO_8859_1.

Answer 1

getBytes 始终 returns 以平台默认字符集编码的字节，对您来说可能是 UTF-8。

Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.

所以你实际上是在尝试用 non-UTF-8 字符集解码一堆 UTF-8 字节。难怪你没有得到预期的结果。

虽然有点无意义，但您可以通过将所需的字符集传递给 getBytes 来获得您想要的内容，以便正确编码字符串。

    System.out.println(new String("ó".getBytes(StandardCharsets.UTF_16), StandardCharsets.UTF_16));
    System.out.println(new String("ó".getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.ISO_8859_1));
    System.out.println(new String("ó".getBytes(StandardCharsets.US_ASCII), StandardCharsets.US_ASCII));
    System.out.println(new String("ó".getBytes(StandardCharsets.UTF_8), StandardCharsets.UTF_8));
    System.out.println(new String("ó".getBytes(StandardCharsets.UTF_16BE), StandardCharsets.UTF_16BE));
    System.out.println(new String("ó".getBytes(StandardCharsets.UTF_16LE), StandardCharsets.UTF_16LE));

你似乎对编码也有一些误解。只是一个字符占用的字节数。两个编码相同的 byte-count-per-character 并不意味着它们彼此兼容。此外，在 UTF-8 中，每个字符并不总是一个字节。 UTF-8 是一个 variable-length 编码。

字符串字节编码问题

String byte encoding issue

java

string

character-encoding