如何删除名称相似的重复文件
How to delete duplicate files with similar name
我是 PowerShell 的新手,我一直无法为我的问题找到明确的答案。我在不同的文件夹中有一堆 excel 文件,这些文件是重复的,但由于它们正在更新而具有不同的文件名。
例如
015 批准保修 - 土耳其 - Case-2019 08-1437015(第 3 期),
015 批准保修 - 土耳其 - Case-2019 08-1437015(最后一期)
015 批准保修 - 土耳其 - Case-2019 08-1437015
015 批准保修 - 土耳其 - Case-2019 08-1437015 修正
我尝试过不同的方法,但现在我知道过滤文件的最简单方法,但不知道语法。锚点将是日期之后的案例编号。我想将案例编号相互比较,只保留最新的(按修改日期)并删除其余的。任何指导表示赞赏。
#take files from folder
$dupesource = 'C:\Users\W_Brooker\Documents\Destination19'
#filter files by case number (7 digit number after date)
$files = Get-ChildItem $dupesource -Filter "08-aaaaaaa"
#If case number is the same keep newest file delete rest
foreach ($file in $files){
$file | Delete-Item - sort -property Datemodified |select -Last 1
}
这应该可以解决问题:
$files = Get-ChildItem 'C:\Users\W_Brooker\Documents\Destination19' -Recurse
# create datatable to store file Information in it
$dt = New-Object system.Data.DataTable
[void]$dt.Columns.Add('FileName',[string]::Empty.GetType() )
[void]$dt.Columns.Add('CaseNumber',[string]::Empty.GetType() )
[void]$dt.Columns.Add('FileTimeStamp',[DateTime]::MinValue.GetType() )
[void]$dt.Columns.Add('DeleteFlag',[byte]::MinValue.GetType() )
# Step 1: Make inventory
foreach( $file in $files ) {
if( !$file.PSIsContainer -and $file.Extension -like '.xls*' -and $file.Name -match '^.*\-\d+ *[\(\.].*$' ) {
$row = $dt.NewRow()
$row.FileName = $file.FullName
$row.CaseNumber = $file.Name -replace '^.*\-(\d+) *[\(\.].*$', ''
$row.FileTimeStamp = $file.LastWriteTime
$row.DeleteFlag = 0
[void]$dt.Rows.Add( $row )
}
}
# Step 2: Mark files to delete
$rows = $dt.Select('', 'CaseNumber, FileTimeStamp DESC')
$caseNumber = ''
foreach( $row in $rows ) {
if( $row.CaseNumber -ne $caseNumber ) {
$caseNumber = $row.CaseNumber
Continue
}
$row.DeleteFlag = 1
[void]$dt.AcceptChanges()
}
# Step 3: Delete files
$rows = $dt.Select('DeleteFlag = 1', 'FileTimeStamp DESC')
foreach( $row in $rows ) {
$fileName = $row.FileName
Remove-Item -Path $fileName -Force | Out-Null
}
这是一个利用 PowerShell Group-Object cmdlet 的替代方法。
它使用正则表达式匹配案例编号的文件,忽略那些没有案例编号的文件。查看底部显示测试数据(测试 xlsx 文件的集合)的屏幕截图
cls
#Assume that each file has an xlsx extension.
#Assume that a case number always looks like this: "Case-YYYY~XX-Z" where YYYY is 4 digits, ~ is a single space, XX is two digits, and Z is one-to-many-digits
#make a list of xlsx files (recursive)
$files = Get-ChildItem -LiteralPath .\ExcelFiles -Recurse -Include *.xlsx
#$file is a System.IO.FileInfo object. Parse out the Case number and add it to the $file object as CaseNumber property
foreach ($file in $files)
{
$Matches = $null
$file.Name -match "(^.*)(Case-\d{4}\s{1}\d{2}-\d{1,})(.*\.xlsx$)" | out-null
if ($Matches.Count -eq 4)
{
$caseNumber = $Matches[2]
$file | Add-Member -NotePropertyName CaseNumber -NotePropertyValue $caseNumber
}
Else
{
#child folders will end up in this group too
$file | Add-Member -NotePropertyName CaseNumber -NotePropertyValue "NoCaseNumber"
}
}
#group the files by CaseNumber
$files | Group-Object -Property CaseNumber -OutVariable fileGroups | out-null
foreach ($fileGroup in $fileGroups)
{
#skip folders and files that don't have a valid case #
if ($fileGroup.Name -eq "NoCaseNumber")
{
continue
}
#for each group: sort files descending by LastWriteTime. Newest file will be first, so skip 1st file and remove the rest
$fileGroup.Group | sort -Descending -Property LastWriteTime | select -skip 1 | foreach {Remove-Item -LiteralPath $_.FullName -Force}
}
测试数据
PowerShell 惯用的解决方案是:
在单个管道中合并多个 cmdlet,
其中Group-Object
提供了根据文件名中的共享案例编号对重复文件进行分组的核心功能:
# Define the regex that matches a case number:
# A 7-digit number embedded in filenames that duplicates share.
$regex = '\b\d{7}\b'
# Enumerate all files and select only those whose name contains a case number.
Get-ChildItem -File $dupesource | Where-Object { $_.BaseName -match $regex } |
# Group the resulting files by shared embedded case number.
Group-Object -Property { [regex]::Match($_.BaseName, $regex).Value } |
# Process each group:
ForEach-Object {
# In each group, sort files by most recently updated first.
$_.Group | Sort-Object -Descending LastWriteTimeUtc |
# Skip the most recent file and delete the older ones.
Select-Object -Skip 1 | Remove-Item -WhatIf
}
-WhatIf
common parameter预览操作。一旦你确定它会做你想做的事就把它移除。
我是 PowerShell 的新手,我一直无法为我的问题找到明确的答案。我在不同的文件夹中有一堆 excel 文件,这些文件是重复的,但由于它们正在更新而具有不同的文件名。 例如 015 批准保修 - 土耳其 - Case-2019 08-1437015(第 3 期), 015 批准保修 - 土耳其 - Case-2019 08-1437015(最后一期) 015 批准保修 - 土耳其 - Case-2019 08-1437015 015 批准保修 - 土耳其 - Case-2019 08-1437015 修正
我尝试过不同的方法,但现在我知道过滤文件的最简单方法,但不知道语法。锚点将是日期之后的案例编号。我想将案例编号相互比较,只保留最新的(按修改日期)并删除其余的。任何指导表示赞赏。
#take files from folder
$dupesource = 'C:\Users\W_Brooker\Documents\Destination19'
#filter files by case number (7 digit number after date)
$files = Get-ChildItem $dupesource -Filter "08-aaaaaaa"
#If case number is the same keep newest file delete rest
foreach ($file in $files){
$file | Delete-Item - sort -property Datemodified |select -Last 1
}
这应该可以解决问题:
$files = Get-ChildItem 'C:\Users\W_Brooker\Documents\Destination19' -Recurse
# create datatable to store file Information in it
$dt = New-Object system.Data.DataTable
[void]$dt.Columns.Add('FileName',[string]::Empty.GetType() )
[void]$dt.Columns.Add('CaseNumber',[string]::Empty.GetType() )
[void]$dt.Columns.Add('FileTimeStamp',[DateTime]::MinValue.GetType() )
[void]$dt.Columns.Add('DeleteFlag',[byte]::MinValue.GetType() )
# Step 1: Make inventory
foreach( $file in $files ) {
if( !$file.PSIsContainer -and $file.Extension -like '.xls*' -and $file.Name -match '^.*\-\d+ *[\(\.].*$' ) {
$row = $dt.NewRow()
$row.FileName = $file.FullName
$row.CaseNumber = $file.Name -replace '^.*\-(\d+) *[\(\.].*$', ''
$row.FileTimeStamp = $file.LastWriteTime
$row.DeleteFlag = 0
[void]$dt.Rows.Add( $row )
}
}
# Step 2: Mark files to delete
$rows = $dt.Select('', 'CaseNumber, FileTimeStamp DESC')
$caseNumber = ''
foreach( $row in $rows ) {
if( $row.CaseNumber -ne $caseNumber ) {
$caseNumber = $row.CaseNumber
Continue
}
$row.DeleteFlag = 1
[void]$dt.AcceptChanges()
}
# Step 3: Delete files
$rows = $dt.Select('DeleteFlag = 1', 'FileTimeStamp DESC')
foreach( $row in $rows ) {
$fileName = $row.FileName
Remove-Item -Path $fileName -Force | Out-Null
}
这是一个利用 PowerShell Group-Object cmdlet 的替代方法。
它使用正则表达式匹配案例编号的文件,忽略那些没有案例编号的文件。查看底部显示测试数据(测试 xlsx 文件的集合)的屏幕截图
cls
#Assume that each file has an xlsx extension.
#Assume that a case number always looks like this: "Case-YYYY~XX-Z" where YYYY is 4 digits, ~ is a single space, XX is two digits, and Z is one-to-many-digits
#make a list of xlsx files (recursive)
$files = Get-ChildItem -LiteralPath .\ExcelFiles -Recurse -Include *.xlsx
#$file is a System.IO.FileInfo object. Parse out the Case number and add it to the $file object as CaseNumber property
foreach ($file in $files)
{
$Matches = $null
$file.Name -match "(^.*)(Case-\d{4}\s{1}\d{2}-\d{1,})(.*\.xlsx$)" | out-null
if ($Matches.Count -eq 4)
{
$caseNumber = $Matches[2]
$file | Add-Member -NotePropertyName CaseNumber -NotePropertyValue $caseNumber
}
Else
{
#child folders will end up in this group too
$file | Add-Member -NotePropertyName CaseNumber -NotePropertyValue "NoCaseNumber"
}
}
#group the files by CaseNumber
$files | Group-Object -Property CaseNumber -OutVariable fileGroups | out-null
foreach ($fileGroup in $fileGroups)
{
#skip folders and files that don't have a valid case #
if ($fileGroup.Name -eq "NoCaseNumber")
{
continue
}
#for each group: sort files descending by LastWriteTime. Newest file will be first, so skip 1st file and remove the rest
$fileGroup.Group | sort -Descending -Property LastWriteTime | select -skip 1 | foreach {Remove-Item -LiteralPath $_.FullName -Force}
}
测试数据
PowerShell 惯用的解决方案是:
在单个管道中合并多个 cmdlet,
其中
Group-Object
提供了根据文件名中的共享案例编号对重复文件进行分组的核心功能:
# Define the regex that matches a case number:
# A 7-digit number embedded in filenames that duplicates share.
$regex = '\b\d{7}\b'
# Enumerate all files and select only those whose name contains a case number.
Get-ChildItem -File $dupesource | Where-Object { $_.BaseName -match $regex } |
# Group the resulting files by shared embedded case number.
Group-Object -Property { [regex]::Match($_.BaseName, $regex).Value } |
# Process each group:
ForEach-Object {
# In each group, sort files by most recently updated first.
$_.Group | Sort-Object -Descending LastWriteTimeUtc |
# Skip the most recent file and delete the older ones.
Select-Object -Skip 1 | Remove-Item -WhatIf
}
-WhatIf
common parameter预览操作。一旦你确定它会做你想做的事就把它移除。