为什么在链 byte[] → String → byte[] 中使用 UTF-8 字符集时输入和输出不同？

Question

以下测试失败。

@Test
public void testConversions() {
    final Charset charset = Charsets.UTF_8;
    final byte[] inputBytes = {37, 80, 68, 70, 45, 49, 46, 52, 13, 10, 37, -11, -28, -10, -4, 13, 10};
    final String string = new String(inputBytes, charset);
    final byte[] outputBytes = string.getBytes(charset);
    assertArrayEquals(inputBytes, outputBytes);
}

如果使用 ISO_8859_1 而不是 UTF-8 字符集，即使使用更大的 inputBytes 数组，测试也会通过。输入和输出是否因为UTF-8的'variable-width' 属性而不同？

奖金问题： 如果使用 ISO_8859_1，转换 byte[] → String → byte[] 将始终具有相同的输入和输出字节数组是否是一个真实的假设？

Answer 1

Do the input and output differ because of 'variable-width' property of UTF-8?

它们之所以不同，是因为由于可变宽度编码，并非所有字节序列都会出现在有效的 UTF-8 编码字符串中。

您可以在 table on the Wikipedia article about UTF-8 中看到：

1 byte: 0xxxxxxx
2 bytes: 110xxxxx 10xxxxxx
3 bytes: 1110xxxx 10xxxxxx 10xxxxxx
4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

中的x表示位，可以是任意0或1；数字显示必须在有效编码中设置为该值的位。

因此，您将永远找不到例如11000000 11000000 在有效的 UTF-8 字符串中。如果你试图从这样的字节构建一个字符串，字符编码将做......一些事情。 Specifically:

[new String(byte[], Charset)] always replaces malformed-input and unmappable-character sequences with this charset's default replacement string

因此，您构建的字符串不一定能够映射回输入。

Bonus question

是的，因为它是一种固定宽度的编码，所有可能的字节都有一个对应的字符。

没有充分的理由尝试将 byte[] 直接转换为 String，除非您知道它是您想要 [=] 的 String 的有效编码45=]recover（并且您知道用于对其进行编码的字符集）（或者，您怀疑它是一个字符串，并且您想要尝试恢复其内容）。

如果你想通过某个需要你发送字符串的频道传输 byte[]，请使用 base64 encoding.

Answer 2

Bonus question: Is it a true presumption that the conversions byte[] → String → byte[] will always have the same input and output byte arrays, if ISO_8859_1 is used?

是的。任何将唯一字符映射到每个字节的 single-byte 字符集都将在 round-trip 转换中保留所有字节值。截至 1987 年，ISO 8859 1 确实对每个字节值都有唯一的映射。

而 CP1252（Windows 拉丁语 1）是 Windows 上的常见默认字符集，具有 5 个字节值，没有字符映射到该值。因此，如果您使用 cp1252 进行往返转换，平均每 256 个字节会损失 5 个字节或大约 2% 的数据

为什么在链 byte[] → String → byte[] 中使用 UTF-8 字符集时输入和输出不同？

Why in chain byte[] → String → byte[] input and output differ when using UTF-8 charset?

java

utf-8

data-conversion

character-encoding