没有意义的 length() 结果

Question

从今天开始，我遇到了一个与 byte[] 到 String 转换相关的非常奇怪的错误。

代码如下：

private static final byte[] test_key = {-112, -57, -45, 125, 91, 126, -118, 13, 83, -60, -119, 57, 38, 118, -115, -52, -92, 39, -24, 75, 59, -21, 88, 84, 66, -125};

public static void main(String[] args) {
    byte[] encryptedArray = xor("ciao".getBytes(), test_key);

    System.out.println("Encrypted arrray: " + Arrays.toString(encryptedArray));
    final String encrypted = new String(encryptedArray);

    System.out.println("Length: " + new String(encryptedArray).length());
    System.out.println(Arrays.toString(encrypted.getBytes()));

    System.out.println("Encrypted value: " + encrypted);
    System.out.println("Decrypted value: " + new String(xor(encrypted.getBytes(), test_key)));
}

private static byte[] xor(byte[] data, byte[] key) {
    byte[] result = new byte[data.length];
    for (int i = 0; i < data.length; i++) {
        result[i] = (byte) (data[i] ^ key[i % key.length]);
    }
    return result;
}

我的输出是：

Encrypted arrray: [-13, -82, -78, 18]
Length: 2
[-17, -65, -67, 18]
Encrypted value: �
Decrypted value: xno

为什么 length() return 2？我错过了什么？

Answer 1

字节和字符之间没有一对一的映射，而是取决于您使用的字符集。字符串在逻辑上是字符序列。所以如果要在chars和bytes之间进行转换，就需要一个字符编码，它指定了chars到bytes的映射，反之亦然。 encryptedArray 中的字节首先转换为 Unicode 字符串，它会尝试从这些字节创建 UTF-8 字符序列。

如果你想使用 String 并恢复准确的字节，你需要对 encryptedArray 做一个 Base64，然后对它做一个 new String() :

String encoded = new String(Base64.getEncoder().encode(encryptedArray));

要检索，只需解码：

Base64.getDecoder().decode(encoded);

Answer 2

字符串的元素不是字节，而是字符。字符不是字节。

有很多方法可以将 char 转换为字节序列（即，许多字符集编码）。

并非每个字符序列都可以转换为字节序列；并不总是每个字符都有一个映射。这取决于您选择的字符集编码。

并非所有字节序列都可以转换为字符串；字节必须在语法上对指定的字符集有效。

Answer 3

我只是想到了一个很好的方法来展示发生了什么，只需将 new String(byte[]) 方法替换为另一个方法，这就是我要回答这个问题的原因。这个执行与构造函数相同的基本操作，但有一个变化：如果发现任何无效字符，它会抛出异常。

private static final byte[] test_key = {-112, -57, -45, 125, 91, 126, -118, 13, 83, -60, -119, 57, 38, 118, -115, -52, -92, 39, -24, 75, 59, -21, 88, 84, 66, -125};

public static void main(String[] args) throws Exception {
    byte[] encryptedArray = xor("ciao".getBytes(), test_key);

    System.out.println("Encrypted arrray: " + Arrays.toString(encryptedArray));
    final String encrypted = new String(encryptedArray);

    // original
    System.out.println("Length: " + new String(encryptedArray).length());
    
    // replacement
    System.out.println("Length: " + decode(encryptedArray).length());
    
    
    System.out.println(Arrays.toString(encrypted.getBytes()));

    System.out.println("Encrypted value: " + encrypted);
    System.out.println("Decrypted value: " + new String(xor(encrypted.getBytes(), test_key)));
}

private static String decode(byte[] encryptedArray) throws CharacterCodingException {
    var decoder = Charset.defaultCharset().newDecoder();
    decoder.onMalformedInput(CodingErrorAction.REPORT);
    var decoded = decoder.decode(ByteBuffer.wrap(encryptedArray));
    return decoded.toString();
}

private static byte[] xor(byte[] data, byte[] key) {
    byte[] result = new byte[data.length];
    for (int i = 0; i < data.length; i++) {
        result[i] = (byte) (data[i] ^ key[i % key.length]);
    }
    return result;
}

该方法被称为 decode 因为这就是您实际在做的事情：您正在 将字节解码 为文本。一个字符编码就是把字符编码成字节，也就是说对面毕竟是解码

如你所见，如果你的平台使用默认的UTF-8编码（Linux，Android、苹果操作系统）。您可以通过在 Windows 上将 Charset.defaultCharset() 替换为 StandardCharsets.UTF_8 来获得相同的结果，后者使用 Windows-1252 字符集（一种单字节编码，它是 Latin-1 的扩展，它本身就是 ASCII 的扩展）。但是，如果您使用 decode 方法，它将产生以下异常：

java.nio.charset.MalformedInputException: Input length = 3
    at java.base/java.nio.charset.CoderResult.throwException(CoderResult.java:274)
    at java.base/java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:815)
    at StackExchange/com.stackexchange.so.ShowBadEncoding.decode(ShowBadEncoding.java:36)
    at StackExchange/com.stackexchange.so.ShowBadEncoding.main(ShowBadEncoding.java:24)

现在您可能希望此处为 4，即字节数组的大小。但请注意，UTF-8 字符可能会在多个字节上进行编码。错误不是发生在整个字符串上，而是发生在它试图读取的最后一个字符上。显然它期望基于先前字节值的更长编码。

如果将 REPORT 替换为默认解码操作 REPLACE（呵呵），您将看到结果与构造函数相同，并且 length() 现在将 return 值 2.

当然，当他说你需要使用base 64编码的时候。这将字节编码为字符，以便保留字节的所有含义，反过来当然是将文本解码回字节。

没有意义的 length() 结果

No sense length() result

java

arrays

encryption

type-conversion

xor