文件 md5 哈希在分块时发生变化（用于 netty 传输）

Question

Question at the bottom

我正在使用 netty 将文件传输到另一台服务器。由于 WebSocket 协议，我将文件块限制为 1024*64 字节 (64KB)。以下方法是一个本地示例文件会发生什么：

public static void rechunck(File file1, File file2) {

    FileInputStream is = null;
    FileOutputStream os = null;

    try {

        byte[] buf = new byte[1024*64];

        is = new FileInputStream(file1);
        os = new FileOutputStream(file2);

        while(is.read(buf) > 0) {
            os.write(buf);
        }

    } catch (IOException e) {
        Controller.handleException(Thread.currentThread(), e);
    } finally {

        try {

            if(is != null && os != null) {
                is.close();
                os.close();
            }

        } catch (IOException e) {
            Controller.handleException(Thread.currentThread(), e);
        }

    }

}

文件由InputStream加载到一个ByteBuffer中并直接写入OutputStream。在此过程中，文件的内容不能更改。

为了获取文件的md5-hashes，我编写了以下方法：

public static String checksum(File file) {

    InputStream is = null;

    try {

        is = new FileInputStream(file);
        MessageDigest digest = MessageDigest.getInstance("MD5");
        byte[] buffer = new byte[8192];
        int read = 0;

        while((read = is.read(buffer)) > 0) {
            digest.update(buffer, 0, read);
        }

        return new BigInteger(1, digest.digest()).toString(16);

    } catch(IOException | NoSuchAlgorithmException e) {
        Controller.handleException(Thread.currentThread(), e);
    } finally {

        try {
            is.close();
        } catch(IOException e) {
            Controller.handleException(Thread.currentThread(), e);
        }

    }

    return null;

}

所以：理论上它应该 return 相同的散列，不是吗？问题是它 return 是两个不同的散列，每个运行.. 文件大小保持不变，内容也一样。当我运行方法一次用于 in: file-1、out: file-2 并再次使用 in: file-2 和 out: file-3 时，file-2 和 file-3 的哈希值是相同的！这意味着该方法每次都会以相同的方式正确更改文件。

1. 58a4a9fbe349a9e0af172f9cf3e6050a
2. 7b3f343fa1b8c4e1160add4c48322373
3. 7b3f343fa1b8c4e1160add4c48322373

这是一个比较所有缓冲区是否相等的小测试。测试呈阳性。所以没有任何区别。

File file1 = new File("controller/templates/Example.zip");
File file2 = new File("controller/templates2/Example.zip");

try {

    byte[] buf1 = new byte[1024*64];
    byte[] buf2 = new byte[1024*64];

    FileInputStream is1 = new FileInputStream(file1);
    FileInputStream is2 = new FileInputStream(file2);

    boolean run = true;
    while(run) {

        int read1 = is1.read(buf1), read2 = is2.read(buf2);
        String result1 = Arrays.toString(buf1), result2 = Arrays.toString(buf2);
        boolean test = result1.equals(result2);

        System.out.println("1: " + result1);
        System.out.println("2: " + result2);
        System.out.println("--- TEST RESULT: " + test + " ----------------------------------------------------");

        if(!(read1 > 0 && read2 > 0) || !test) run = false;

    }

} catch (IOException e) {
    e.printStackTrace();
}

问题：你能帮我在不更改哈希值的情况下对文件进行分块吗？

Answer 1

while(is.read(buf) > 0) {
    os.write(buf);
}

带有数组参数的read()方法将从流中读取的文件数return。当文件未完全以字节数组长度的倍数结束时，此 return 值将小于字节数组长度，因为您已到达文件末尾。

然而，您的 os.write(buf); 调用会将整个字节数组写入流，包括文件结束后的剩余字节。这意味着写入的文件最终会变大，因此哈希值会发生变化。

有趣的是，您在更新消息摘要时没有犯错：

while((read = is.read(buffer)) > 0) {
    digest.update(buffer, 0, read);
}

当你 "rechunk" 你的文件时，你只需要做同样的事情。

Answer 2

您的重新分块方法有一个错误。因为你在那里有一个固定的缓冲区，所以你的文件被分成 ByteArray-parts。但是文件的最后一部分可能比缓冲区小，这就是你在新文件中写入太多字节的原因。这就是为什么您不再具有相同的校验和的原因。错误可以这样修复：

public static void rechunck(File file1, File file2) {

    FileInputStream is = null;
    FileOutputStream os = null;

    try {

        byte[] buf = new byte[1024*64];

        is = new FileInputStream(file1);
        os = new FileOutputStream(file2);
        int length;
        while((length = is.read(buf)) > 0) {
            os.write(buf, 0, length);
        }

    } catch (IOException e) {
        Controller.handleException(Thread.currentThread(), e);
    } finally {

        try {

            if(is != null)
                is.close();
            if(os != null)
                os.close();

        } catch (IOException e) {
            Controller.handleException(Thread.currentThread(), e);
        }

    }

}

由于长度变量，write方法知道直到字节数组的字节x，只有文件关闭，然后里面还有旧字节不再属于文件。

文件 md5 哈希在分块时发生变化（用于 netty 传输）

File md5 hash changes when chunking it (for netty transfer)

java

md5

netty

filehash