优化：递归计算根目录下大量文件的MD5 hash

Question

我当前生成根目录下所有文件的 MD5 哈希值的方法如下所示。

截至目前，处理约 300 张图像大约需要 10 秒（旧的英特尔酷睿 i3 cpu），每张图像的平均大小为 5-10 MB。 stream 中的 parallel 选项没有帮助。有或没有它，时间或多或少保持不变。我怎样才能让它更快？

Files.walk(Path.of(rootDir), depth)
            .parallel() // doesn't help, time appx same as without parallel
            .filter(path -> !Files.isDirectory(path)) // skip directories
            .map(FileHash::getHash)
            .collect(Collectors.toList());

上面使用的 getHash 方法为流中正在处理的每个文件提供逗号分隔的 hash,<full file path> 输出行。

public static String getHash(Path path) {
    MessageDigest md5 = null;
    try {
      md5 = MessageDigest.getInstance("MD5");
      md5.update(Files.readAllBytes(path));
    } catch (Exception e) {
      e.printStackTrace();
    }
    byte[] digest = md5.digest();
    String hash = DatatypeConverter.printHexBinary(digest).toUpperCase();
    return String.format("%s,%s", hash, path.toAbsolutePath());
  }

Answer 1

Files.walk(Path.of(rootDir), depth)返回的流无法高效并行化（他没有大小，所以很难确定要并行化的切片）。在您提高性能的情况下，您需要首先收集 Files.walk(...).

的结果

所以你必须做：

Files.walk(Path.of(rootDir), depth)
        .filter(path -> !Files.isDirectory(path)) // skip directories
        .collect(Collectors.toList())
        .stream()
        .parallel() // in my computer divide the time needed by 5 (8 core cpu and SSD disk)
        .map(FileHash::getHash)
        .collect(Collectors.toList());

优化：递归计算根目录下大量文件的MD5 hash

Optimize: Calculating MD5 hash of large number of files recursively under a root folder

parallel-processing

optimization

checksum

stream

java-11