Powershell 速度:如何加速 ForEach-Object MD5/hash 检查

Powershell Speed: How to speed up ForEach-Object MD5/hash check

我运行正在对 5 亿个文件进行以下 MD5 检查以检查重复项。脚本永远占用 运行,我想知道如何加快速度。我怎样才能加快速度?当哈希已经存在时,我可以使用 try catch 循环而不是 contains 来抛出错误吗?大家会推荐什么?

$folder = Read-Host -Prompt 'Enter a folder path'

$hash = @{}
$lineCheck = 0

Get-ChildItem $folder -Recurse | where {! $_.PSIsContainer} | ForEach-Object {
    $lineCheck++
    Write-Host $lineCheck
    $tempMD5 = (Get-FileHash -LiteralPath $_.FullName -Algorithm MD5).Hash;

    if(! $hash.Contains($tempMD5)){
        $hash.Add($tempMD5,$_.FullName)
    }
    else{
        Remove-Item -literalPath $_.fullname;
    }
} 

我猜你的代码中最慢的部分是 Get-FileHash 调用,因为其他所有内容要么不是计算密集型的,要么受限于你的硬件(磁盘 IOPS)。

您可以尝试将其替换为调用具有更优化的 MD5 实现的本机工具,看看是否有帮助。

Could I use a try catch loop instead of contains to throw an error when the hash already exists instead?

异常很慢,不建议使用它们进行流量控制:

While the use of exception handlers to catch errors and other events that disrupt program execution is a good practice, the use of exception handler as part of the regular program execution logic can be expensive and should be avoided

There is the definitive answer to this from the guy who implemented them - Chris Brumme. He wrote an excellent blog article about the subject (warning - its very long)(warning2 - its very well written, if you're a techie you'll read it to the end and then have to make up your hours after work :) )

The executive summary: they are slow. They are implemented as Win32 SEH exceptions, so some will even pass the ring 0 CPU boundary!

如评论中所建议,如果与首先找到的文件长度匹配,您可以考虑开始散列文件。这意味着您不会为任何唯一的文件长度调用昂贵的散列方法。

*注意:Write-Host 命令本身就非常昂贵,因此我 不会 显示每次迭代(Write-Host $lineCheck ) 但例如仅当找到匹配时。

$Folder = Read-Host -Prompt 'Enter a folder path'

$FilesBySize = @{}
$FilesByHash = @{}

Function MatchHash([String]$FullName) {
    $Hash = (Get-FileHash -LiteralPath $FullName -Algorithm MD5).Hash
    $Found = $FilesByHash.Contains($Hash)
    If ($Found) {$Null = $FilesByHash[$Hash].Add($FullName)}
    Else {$FilesByHash[$Hash] = [System.Collections.ArrayList]@($FullName)}
    $Found
}

Get-ChildItem $Folder -Recurse | Where-Object -Not PSIsContainer | ForEach-Object {
    $Files = $FilesBySize[$_.Length]
    If ($Files) {
        If ($Files.Count -eq 1) {$Null = MatchHash $Files[0]}
        If ($Files.Count -ge 1) {If (MatchHash $_) {Write-Host 'Found match:' $_.FullName}}
        $Null = $FilesBySize[$_.Length].Add($_.FullName)
    } Else {
        $FilesBySize[$_.Length] = [System.Collections.ArrayList]@($_.FullName)
    }
}

显示找到的重复项:

ForEach($Hash in $FilesByHash.GetEnumerator()) {
    If ($Hash.Value.Count -gt 1) {
        Write-Host 'Hash:' $Hash.Name
        ForEach ($File in $Hash.Value) {
            Write-Host 'File:' $File
        }
    }
}

我知道这是一个 PowerShell 问题,但您可以在 C# 中充分利用并行化。您还在其中一条评论中提到了使用 C# 作为替代方案,因此我认为发布有关如何完成它的可能实现不会有什么坏处。

您可以先创建一个方法来计算文件的 MD5 校验和:

private static string CalculateMD5(string filename)
{
    using var md5 = MD5.Create();
    using var stream = File.OpenRead(filename);
    var hash = md5.ComputeHash(stream);
    return BitConverter.ToString(hash).Replace("-", string.Empty).ToLowerInvariant();
}

然后你可以使用 ParallelEnumerable.AsParallel():

创建一个查询所有文件哈希的方法
private static IEnumerable<FileHash> FindFileHashes(string directoryPath)
{
    var allFiles = Directory
        .EnumerateFiles(directoryPath, "*", SearchOption.AllDirectories);

    var hashedFiles = allFiles
        .AsParallel()
        .Select(filename => new FileHash { 
            FileName = filename, 
            Hash = CalculateMD5(filename) 
        });

    return hashedFiles;
}

那么就可以简单的使用上面的方法来删除重复的文件:

private static void DeleteDuplicateFiles(string directoryPath)
{
    var fileHashes = new HashSet<string>();

    foreach (var fileHash in FindFileHashes(directoryPath))
    {
        if (!fileHashes.Contains(fileHash.Hash))
        {
            Console.WriteLine($"Found - File : {fileHash.FileName} Hash : {fileHash.Hash}");
            fileHashes.Add(fileHash.Hash);
            continue;
        }

        Console.WriteLine($"Deleting - File : {fileHash.FileName} Hash : {fileHash.Hash}");
        File.Delete(fileHash.FileName);
    }
}

完整节目:

using System;
using System.Collections.Generic;
using System.Linq;
using System.IO;
using System.Security.Cryptography;

namespace Test
{
    internal class FileHash
    {
        public string FileName { get; set; }
        public string Hash { get; set; }
    }

    public class Program
    {
        public static void Main()
        { 
            var path = "C:\Path\To\Files";
            if (File.Exists(path))
            {
                Console.WriteLine($"Deleting duplicate files at {path}");
                DeleteDuplicateFiles(path);
            }
        }

        private static void DeleteDuplicateFiles(string directoryPath)
        {
            var fileHashes = new HashSet<string>();

            foreach (var fileHash in FindFileHashes(directoryPath))
            {
                if (!fileHashes.Contains(fileHash.Hash))
                {
                    Console.WriteLine($"Found - File : {fileHash.FileName} Hash : {fileHash.Hash}");
                    fileHashes.Add(fileHash.Hash);
                    continue;
                }

                Console.WriteLine($"Deleting - File : {fileHash.FileName} Hash : {fileHash.Hash}");
                File.Delete(fileHash.FileName);
            }
        }

        private static IEnumerable<FileHash> FindFileHashes(string directoryPath)
        {
            var allFiles = Directory
                .EnumerateFiles(directoryPath, "*", SearchOption.AllDirectories);

            var hashedFiles = allFiles
                .AsParallel()
                .Select(filename => new FileHash { 
                    FileName = filename, 
                    Hash = CalculateMD5(filename) 
                });

            return hashedFiles;
        }

        private static string CalculateMD5(string filename)
        {
            using var md5 = MD5.Create();
            using var stream = File.OpenRead(filename);
            var hash = md5.ComputeHash(stream);
            return BitConverter.ToString(hash).Replace("-", string.Empty).ToLowerInvariant();
        }
    }
}

如果您要查找重复项,最快的方法是使用 jdupes 或 fdupes 之类的工具。它们的性能令人难以置信,并且是用 C 语言编写的。