PowerShell 7.0 如何计算分块读取的大文件的哈希和

PowerShell 7.0 how to compute hashsum of a big file read in chunks

脚本应该复制文件并计算它们的哈希和。 我的目标是创建将读取文件一次而不是 3 次( read_for_copy + read_for_hash + read_for_another_copy )的函数,以最大限度地减少网络负载。 所以我尝试读取一大块文件然后计算 md5 哈希和并将文件写到几个地方。 文件的大小可能从 100 MB 到 2 TB 不等,甚至更大。此时不需要检查文件身份,只需要计算初始文件的哈希和。

我在计算哈希和方面遇到了困难:

    $ifile = "C:\Users\User\Desktop\inputfile"
    $ofile = "C:\Users\User\Desktop\outputfile_1"
    $ofile2 = "C:\Users\User\Desktop\outputfile_2"
    
    $md5 = new-object -TypeName System.Security.Cryptography.MD5CryptoServiceProvider
    $bufferSize = 10mb
    $stream = [System.IO.File]::OpenRead($ifile)
    $makenew = [System.IO.File]::OpenWrite($ofile)
    $makenew2 = [System.IO.File]::OpenWrite($ofile2)
    $buffer = new-object Byte[] $bufferSize
    
    while ( $stream.Position -lt $stream.Length ) {
       
     $bytesRead = $stream.Read($buffer, 0, $bufferSize)
     $makenew.Write($buffer, 0, $bytesread) 
     $makenew2.Write($buffer, 0, $bytesread) 
    
     # I am stuck here
     $hash = [System.BitConverter]::ToString($md5.ComputeHash($buffer)) -replace "-",""      
            
            }
    
    $stream.Close()
    $makenew.Close()
    $makenew2.Close()

如何收集数据块来计算整个文件的哈希值?

还有一个额外的问题:是否可以通过并行方式计算散列并写出数据?特别是考虑到 PS 版本 6 不支持 workflow {parallel{}} ?

非常感谢

如果你想手动处理输入缓冲,你需要使用 TransformBlock/TransformFinalBlock 暴露的方法 $md5:

while($bytesRead = $stream.Read($buffer, 0, $bufferSize))
{
    # Write to file copies
    $makenew.Write($buffer, 0, $bytesread) 
    $makenew2.Write($buffer, 0, $bytesread)

    # Feed next chunk to MD5 CSP
    $null = $md5.TransformBlock($buffer, 0 , $bytesRead, $null, 0)
}

# Complete the hashing routine
$md5.TransformFinalBlock([byte[]]::new(0), 0, 0)

# Grab hash value from CSP
$hash = [BitConverter]::ToString($md5.Hash).Replace('-','')

My goal is make the function which will read the file once instead of 3 ( read_for_copy + read_for_hash + read_for_another_copy ) to minimize network load

我不完全确定您所说的网络负载是什么意思。如果 源文件 在远程文件共享上,但 新副本 进入本地文件系统,您可以通过简单地复制来最小化网络负载源文件一次,然后使用该副本作为第二个副本和哈希计算的来源:

$ifile = "\remoteMachine\c$\Users\User\Desktop\inputfile"
$ofile = "C:\Users\User\Desktop\outputfile_1"
$ofile2 = "C:\Users\User\Desktop\outputfile_2"
    
# Copy remote -> local
Copy-Item -Path $ifile -Destination $ofile
# Copy local -> local
Copy-Item -Path $ofile -Destination $ofile2

# Hash local file stream
$md5 = New-Object -TypeName System.Security.Cryptography.MD5CryptoServiceProvider
$stream = [System.IO.File]::OpenRead($ofile)
$hash = [BitConverter]::ToString($md5.ComputeHash($stream)).Replace('-','')

FWIW,将文件流对象直接传递给 $md5.ComputeHash($stream) 可能比手动缓冲输入更快

最终上市

$ifile = "C:\Users\User\Desktop\inputfile"
$ofile = "C:\Users\User\Desktop\outputfile_1"
$ofile2 = "C:\Users\User\Desktop\outputfile_2"

$md5 = new-object -TypeName System.Security.Cryptography.MD5CryptoServiceProvider
$bufferSize = 1mb
$stream = [System.IO.File]::OpenRead($ifile)
$makenew = [System.IO.File]::OpenWrite($ofile)
$makenew2 = [System.IO.File]::OpenWrite($ofile2)
$buffer = new-object Byte[] $bufferSize

while ( $stream.Position -lt $stream.Length ) 
{
     $bytesRead = $stream.Read($buffer, 0, $bufferSize)
     $makenew.Write($buffer, 0, $bytesread) 
     $makenew2.Write($buffer, 0, $bytesread) 
    
     $hash = $md5.TransformBlock($buffer, 0 , $bytesRead, $null , 0)  
} 

$md5.TransformFinalBlock([byte[]]::new(0), 0, 0)
$hash = [BitConverter]::ToString($md5.Hash).Replace('-','')      
$hash
$stream.Flush()
$stream.Close()
$makenew.Flush()
$makenew.Close()
$makenew2.Flush()
$makenew2.Close()