PowerShell 检测重复文件
PowerShell Detect Duplicate Files
我有一个 PowerShell 脚本(见下文)可以找到在输入路径之外找到的所有重复文件,如果找到任何重复文件,我会通过电子邮件发送包含该信息的附件。
它在我的个人机器上工作,我目前正在我们的服务器上测试它。我从来没有期望它会很快,但我目前已经进行了一个小时的测试,但仍未完成!
那么我的问题是,我能做些什么来减少 运行 所需的时间?
作为一个附加问题:我通过 PowerShell 加密了密码文件...是否可以让有权访问该文件的人解密它并以纯文本形式查看密码?
如有任何帮助,我们将不胜感激!
$sourcepath = "\server1\privatetest\"
$duplicatepath = "\server1\public\"
$dup_found = 0
function Send-ToEmail([string]$email, [string]$attachmentpath){
$message = new-object Net.Mail.MailMessage;
$message.From = "MyEmail@MyDomain.com";
$message.To.Add($email);
$message.Subject = "Duplicate Found";
$message.Body = "Please see attachment";
$attachment = New-Object Net.Mail.Attachment($attachmentpath);
$message.Attachments.Add($attachment);
$smtp = new-object Net.Mail.SmtpClient("smtp.gmail.com", "587");
$smtp.EnableSSL = $true;
$smtp.Credentials = New-Object System.Net.NetworkCredential($Username, $Password);
$smtp.send($message);
$attachment.Dispose();
}
If ((Test-Path $sourcepath) -AND (Test-Path $duplicatepath)) {
$sourcefiles = Get-ChildItem $sourcepath -File -Recurse -ErrorAction SilentlyContinue | Get-FileHash
$dupfiles = Get-ChildItem $duplicatepath -File -Recurse -ErrorAction SilentlyContinue | Get-FileHash
$duplicates = [System.Collections.ArrayList]@()
If (($sourcefiles.count -eq 0) -or ($dupfiles.count -eq 0)) {
If ($sourcefiles.count -eq 0) {
Write-Warning 'No files found in source path'
}
else {
Write-Warning 'No files found in duplicate path'
}
Break
}
else {
foreach ($sf in $sourcefiles) {
$result1path = $sf | Select -Property Path
$result1hash = $sf | Select -Property Hash
foreach ($df in $dupfiles) {
$result2path = $df | Select -Property Path
$result2hash = $df | Select -Property Hash
If (($result1hash) -like ($result2hash)) {
$dup_found = 1
$dupmsg = 'Source Path: '
$dupmsg = $dupmsg + $result1path
$dupmsg = $dupmsg + ', Source Hash: '
$dupmsg = $dupmsg + $result1hash
$dupmsg = $dupmsg + ', Duplicate Path: '
$dupmsg = $dupmsg + $result2path
$dupmsg = $dupmsg + ', Duplicate Hash: '
$dupmsg = $dupmsg + $result2hash
$duplicates = $duplicates + $dupmsg
}
}
}
if ($dup_found -eq 1) {
$Username = "MyEmail@MyDomain.com";
$pwfile = Get-Content "PasswordFile"
$Password = $pwfile | ConvertTo-SecureString
$path = "C:\temp\duplicates.txt";
$duplicates | Out-File -FilePath C:\temp\duplicates.txt
Send-ToEmail -email "MyEmail@MyDomain.com" -attachmentpath $path;
Remove-Item C:\temp\duplicates.txt
}
}
}
else {
If(!(Test-Path $sourcepath)) {
Write-Warning 'Source path not found'
}
elseif(!(Test-Path $duplicatepath)) {
Write-Warning 'Duplicate path not found'
}
}
[...], is there anything I can do to reduce the time it takes to run?
有,肯定有!
降低运行时复杂度
你这里有一个经典的性能问题 - 通过将一个集合中的每个文件与另一个集合中的所有其他文件进行比较,你创建了一个二次算法.
二次方是什么意思?这意味着对于每个集合中的 N
个输入项,您现在必须执行 N^2
次比较 - 因此如果每个目录包含 1 个文件,您只需要一次比较 - 但对于 2 个文件,您需要进行 4 次比较, 3 个文件 = 9 次比较等 - 每个目录中已经只有 100 个文件,您需要进行 10.000(!) 次比较。
相反,您需要使用一种数据结构来快速确定特定值是否包含在其中。为此,您可以使用散列 table:
# Create a hashtable
$sourceFileIndex = @{}
# Use source files to populate the hashtable - we'll use the calculate hash as the key
$sourcefiles = Get-ChildItem $sourcepath -File -Recurse -ErrorAction SilentlyContinue |ForEach-Object {
$hashed = $_ |Get-FileHash
$sourceFileIndex[$hashed.Hash] = $hashed
}
# Keep the potential duplicates in an array, no need to change anything here
$dupfiles = Get-ChildItem $duplicatepath -File -Recurse -ErrorAction SilentlyContinue | Get-FileHash
#...
# Now we can remove the outer loop completely
foreach ($df in $dupfiles) {
# Here's the magic - replace the string comparison with a call to ContainsKey()
if ($sourceFileIndex.ContainsKey($df.Hash)) {
$dup_found = 1
$dupmsg = 'Source Path: '
$dupmsg = $dupmsg + $result1path
$dupmsg = $dupmsg + ', Source Hash: '
$dupmsg = $dupmsg + $result1hash
$dupmsg = $dupmsg + ', Duplicate Path: '
$dupmsg = $dupmsg + $result2path
$dupmsg = $dupmsg + ', Duplicate Hash: '
$dupmsg = $dupmsg + $result2hash
$duplicates = $duplicates + $dupmsg
}
}
这应该已经给你带来了巨大的性能提升。
将字符串操作减少到最少
您当前方法的另一个代价高昂的方面(虽然没有上述问题那么重要)是常量字符串连接 - 运行时需要为所有单独的小子字符串重新分配内存,这最终会造成损失处理大量数据时的执行时间。
减少字符串操作的一种方法是创建结构化对象,而不是维护 运行“输出字符串”:
foreach ($df in $dupfiles) {
# Here's the magic - replace the string comparison with a call to ContainsKey()
if ($sourceFileIndex.ContainsKey($df.Hash)) {
$dup_found = 1
# Create output object
$dupeRecord = [pscustomobject]@{
SourcePath = $sourceFileIndex[$df.Hash].Path
SourceHash = $df.Hash # these are identical, no need to fetch the "source hash"
DuplicatePath = $df.Path
DuplicateHash = $df.Hash
}
[void]$duplicates.Add($dupeRecord)
}
}
这又带来了一个进步!由于这些是对象(与原始字符串相反),因此在输出格式方面您现在有更大的 choice/flexibility:
# Want an HTML table? Go ahead!
$duplicates |ConvertTo-Html -As Table |Out-File .\path\to\attachment.html
我有一个 PowerShell 脚本(见下文)可以找到在输入路径之外找到的所有重复文件,如果找到任何重复文件,我会通过电子邮件发送包含该信息的附件。
它在我的个人机器上工作,我目前正在我们的服务器上测试它。我从来没有期望它会很快,但我目前已经进行了一个小时的测试,但仍未完成!
那么我的问题是,我能做些什么来减少 运行 所需的时间?
作为一个附加问题:我通过 PowerShell 加密了密码文件...是否可以让有权访问该文件的人解密它并以纯文本形式查看密码?
如有任何帮助,我们将不胜感激!
$sourcepath = "\server1\privatetest\"
$duplicatepath = "\server1\public\"
$dup_found = 0
function Send-ToEmail([string]$email, [string]$attachmentpath){
$message = new-object Net.Mail.MailMessage;
$message.From = "MyEmail@MyDomain.com";
$message.To.Add($email);
$message.Subject = "Duplicate Found";
$message.Body = "Please see attachment";
$attachment = New-Object Net.Mail.Attachment($attachmentpath);
$message.Attachments.Add($attachment);
$smtp = new-object Net.Mail.SmtpClient("smtp.gmail.com", "587");
$smtp.EnableSSL = $true;
$smtp.Credentials = New-Object System.Net.NetworkCredential($Username, $Password);
$smtp.send($message);
$attachment.Dispose();
}
If ((Test-Path $sourcepath) -AND (Test-Path $duplicatepath)) {
$sourcefiles = Get-ChildItem $sourcepath -File -Recurse -ErrorAction SilentlyContinue | Get-FileHash
$dupfiles = Get-ChildItem $duplicatepath -File -Recurse -ErrorAction SilentlyContinue | Get-FileHash
$duplicates = [System.Collections.ArrayList]@()
If (($sourcefiles.count -eq 0) -or ($dupfiles.count -eq 0)) {
If ($sourcefiles.count -eq 0) {
Write-Warning 'No files found in source path'
}
else {
Write-Warning 'No files found in duplicate path'
}
Break
}
else {
foreach ($sf in $sourcefiles) {
$result1path = $sf | Select -Property Path
$result1hash = $sf | Select -Property Hash
foreach ($df in $dupfiles) {
$result2path = $df | Select -Property Path
$result2hash = $df | Select -Property Hash
If (($result1hash) -like ($result2hash)) {
$dup_found = 1
$dupmsg = 'Source Path: '
$dupmsg = $dupmsg + $result1path
$dupmsg = $dupmsg + ', Source Hash: '
$dupmsg = $dupmsg + $result1hash
$dupmsg = $dupmsg + ', Duplicate Path: '
$dupmsg = $dupmsg + $result2path
$dupmsg = $dupmsg + ', Duplicate Hash: '
$dupmsg = $dupmsg + $result2hash
$duplicates = $duplicates + $dupmsg
}
}
}
if ($dup_found -eq 1) {
$Username = "MyEmail@MyDomain.com";
$pwfile = Get-Content "PasswordFile"
$Password = $pwfile | ConvertTo-SecureString
$path = "C:\temp\duplicates.txt";
$duplicates | Out-File -FilePath C:\temp\duplicates.txt
Send-ToEmail -email "MyEmail@MyDomain.com" -attachmentpath $path;
Remove-Item C:\temp\duplicates.txt
}
}
}
else {
If(!(Test-Path $sourcepath)) {
Write-Warning 'Source path not found'
}
elseif(!(Test-Path $duplicatepath)) {
Write-Warning 'Duplicate path not found'
}
}
[...], is there anything I can do to reduce the time it takes to run?
有,肯定有!
降低运行时复杂度
你这里有一个经典的性能问题 - 通过将一个集合中的每个文件与另一个集合中的所有其他文件进行比较,你创建了一个二次算法.
二次方是什么意思?这意味着对于每个集合中的 N
个输入项,您现在必须执行 N^2
次比较 - 因此如果每个目录包含 1 个文件,您只需要一次比较 - 但对于 2 个文件,您需要进行 4 次比较, 3 个文件 = 9 次比较等 - 每个目录中已经只有 100 个文件,您需要进行 10.000(!) 次比较。
相反,您需要使用一种数据结构来快速确定特定值是否包含在其中。为此,您可以使用散列 table:
# Create a hashtable
$sourceFileIndex = @{}
# Use source files to populate the hashtable - we'll use the calculate hash as the key
$sourcefiles = Get-ChildItem $sourcepath -File -Recurse -ErrorAction SilentlyContinue |ForEach-Object {
$hashed = $_ |Get-FileHash
$sourceFileIndex[$hashed.Hash] = $hashed
}
# Keep the potential duplicates in an array, no need to change anything here
$dupfiles = Get-ChildItem $duplicatepath -File -Recurse -ErrorAction SilentlyContinue | Get-FileHash
#...
# Now we can remove the outer loop completely
foreach ($df in $dupfiles) {
# Here's the magic - replace the string comparison with a call to ContainsKey()
if ($sourceFileIndex.ContainsKey($df.Hash)) {
$dup_found = 1
$dupmsg = 'Source Path: '
$dupmsg = $dupmsg + $result1path
$dupmsg = $dupmsg + ', Source Hash: '
$dupmsg = $dupmsg + $result1hash
$dupmsg = $dupmsg + ', Duplicate Path: '
$dupmsg = $dupmsg + $result2path
$dupmsg = $dupmsg + ', Duplicate Hash: '
$dupmsg = $dupmsg + $result2hash
$duplicates = $duplicates + $dupmsg
}
}
这应该已经给你带来了巨大的性能提升。
将字符串操作减少到最少
您当前方法的另一个代价高昂的方面(虽然没有上述问题那么重要)是常量字符串连接 - 运行时需要为所有单独的小子字符串重新分配内存,这最终会造成损失处理大量数据时的执行时间。
减少字符串操作的一种方法是创建结构化对象,而不是维护 运行“输出字符串”:
foreach ($df in $dupfiles) {
# Here's the magic - replace the string comparison with a call to ContainsKey()
if ($sourceFileIndex.ContainsKey($df.Hash)) {
$dup_found = 1
# Create output object
$dupeRecord = [pscustomobject]@{
SourcePath = $sourceFileIndex[$df.Hash].Path
SourceHash = $df.Hash # these are identical, no need to fetch the "source hash"
DuplicatePath = $df.Path
DuplicateHash = $df.Hash
}
[void]$duplicates.Add($dupeRecord)
}
}
这又带来了一个进步!由于这些是对象(与原始字符串相反),因此在输出格式方面您现在有更大的 choice/flexibility:
# Want an HTML table? Go ahead!
$duplicates |ConvertTo-Html -As Table |Out-File .\path\to\attachment.html