在 Powershell 中使用调用异步复制项目

Copy-item using invoke-async in Powershell

本文展示了如何在 PowerShell 中使用 Invoke-Async:https://sqljana.wordpress.com/2018/03/16/powershell-sql-server-run-in-parallel-collect-sql-results-with-print-output-from-across-your-sql-farm-fast/

我希望 运行 在 PowerShell 中并行复制项目 cmdlet,因为另一种方法是通过 Excel 使用 FileSystemObject 并一次从总共数百万个文件中复制一个文件.

我拼凑了以下内容:

.SYNOPSIS
<Brief description>
For examples type:
Get-Help .\<filename>.ps1 -examples
.DESCRIPTION
Copys files from one path to another
.PARAMETER FileList
e.g. C:\path\to\list\of\files\to\copy.txt
.PARAMETER NumCopyThreads
default is 8 (but can be 100 if you want to stress the machine to maximum!)
.EXAMPLE
.\CopyFilesToBackup -filelist C:\path\to\list\of\files\to\copy.txt
.NOTES
#>

[CmdletBinding()] 
Param( 
    [String] $FileList = "C:\temp\copytest.csv", 
    [int] $NumCopyThreads = 8
) 

$filesToCopy = New-Object "System.Collections.Generic.List[fileToCopy]"
$csv = Import-Csv $FileList

foreach($item in $csv)
{
    $file = New-Object fileToCopy
    $file.SrcFileName = $item.SrcFileName
    $file.DestFileName = $item.DestFileName
    $filesToCopy.add($file)
}

$sb = [scriptblock] {
    param($file)
    Copy-item -Path $file.SrcFileName -Destination $file.DestFileName
}
$results = Invoke-Async -Set $filesToCopy -SetParam file -ScriptBlock $sb -Verbose -Measure:$true -ThreadCount 8
$results | Format-Table

Class fileToCopy {
    [String]$SrcFileName = ""
    [String]$DestFileName = ""
}

csv 输入如下所示:

SrcFileName,DestFileName
C:\Temp\dummy-data14381438-0154723869.zip,\backupserver\Project Archives143854723869.zip
C:\Temp\dummy-data14381438-0165498273.xlsx,\backupserver\Project Archives143865498273.xlsx

我缺少什么才能让它工作,因为当我 运行 .\CopyFiles.ps1 -FileList C:\Temp\test.csv 没有任何反应。文件存在于源路径中,但不会从 -Set 集合中提取文件对象。 (除非我误解了集合的使用方式?)

不,我不能使用 robocopy 来执行此操作,因为有数百万个文件根据其原始位置解析为不同的路径。

根据您问题中的代码,我无法解释您的症状(请参阅底部部分),但我建议您的解决方案基于(现在)标准 Start-ThreadJob cmdlet(随 PowerShell Core 提供;在 Windows PowerShell 中,使用 Install-Module ThreadJob -Scope CurrentUser 安装它,例如 [1]):

这样的解决方案比使用第三方 Invoke-Async 函数更有效,后者在撰写本文时存在缺陷,因为它等待作业在 紧密循环中完成 ,这会产生不必要的处理开销。

Start-ThreadJob 作业是基于进程的 Start-Job 后台作业的轻型、基于线程的替代方案,但它们与标准作业管理 cmdlet 集成,例如Wait-JobReceive-Job.

这是一个基于您的代码的独立示例,演示了它的用法:

注意:无论使用Start-ThreadJob还是Invoke-Async,都无法显式引用自定义classes,例如脚本块中的 [fileToCopy],运行s 在单独的线程中 (运行 空格;见底部),因此下面的解决方案仅使用 [pscustomobject] 具有简单和简洁的感兴趣属性的实例。

# Create sample CSV file with 10 rows.
$FileList = Join-Path ([IO.Path]::GetTempPath()) "tmp.$PID.csv"
@'
Foo,SrcFileName,DestFileName,Bar
1,c:\tmp\a,\server\share\a,baz
2,c:\tmp\b,\server\share\b,baz
3,c:\tmp\c,\server\share\c,baz
4,c:\tmp\d,\server\share\d,baz
5,c:\tmp\e,\server\share\e,baz
6,c:\tmp\f,\server\share\f,baz
7,c:\tmp\g,\server\share\g,baz
8,c:\tmp\h,\server\share\h,baz
9,c:\tmp\i,\server\share\i,baz
10,c:\tmp\j,\server\share\j,baz
'@ | Set-Content $FileList

# How many threads at most to run concurrently.
$NumCopyThreads = 8

Write-Host 'Creating jobs...'
$dtStart = [datetime]::UtcNow

# Import the CSV data and transform it to [pscustomobject] instances
# with only .SrcFileName and .DestFileName properties - they take
# the place of your original [fileToCopy] instances.
$jobs = Import-Csv $FileList | Select-Object SrcFileName, DestFileName | 
  ForEach-Object {
    # Start the thread job for the file pair at hand.
    Start-ThreadJob -ThrottleLimit $NumCopyThreads -ArgumentList $_ { 
      param($f) 
      $simulatedRuntimeMs = 2000 # How long each job (thread) should run for.
      # Delay output for a random period.
      $randomSleepPeriodMs = Get-Random -Minimum 100 -Maximum $simulatedRuntimeMs
      Start-Sleep -Milliseconds $randomSleepPeriodMs
      # Produce output.
      "Copied $($f.SrcFileName) to $($f.DestFileName)"
      # Wait for the remainder of the simulated runtime.
      Start-Sleep -Milliseconds ($simulatedRuntimeMs - $randomSleepPeriodMs)
    }
  }

Write-Host "Waiting for $($jobs.Count) jobs to complete..."

# Synchronously wait for all jobs (threads) to finish and output their results
# *as they become available*, then remove the jobs.
# NOTE: Output will typically NOT be in input order.
Receive-Job -Job $jobs -Wait -AutoRemoveJob
Write-Host "Total time lapsed: $([datetime]::UtcNow - $dtStart)"

# Clean up the temp. file
Remove-Item $FileList

上面的结果类似于:

Creating jobs...
Waiting for 10 jobs to complete...
Copied c:\tmp\b to \server\share\b
Copied c:\tmp\g to \server\share\g
Copied c:\tmp\d to \server\share\d
Copied c:\tmp\f to \server\share\f
Copied c:\tmp\e to \server\share\e
Copied c:\tmp\h to \server\share\h
Copied c:\tmp\c to \server\share\c
Copied c:\tmp\a to \server\share\a
Copied c:\tmp\j to \server\share\j
Copied c:\tmp\i to \server\share\i
Total time lapsed: 00:00:05.1961541

请注意,收到的输出不反映输入顺序,总体 运行时间大约是每个线程 运行 的 2 倍2 秒的时间(加上开销),因为 2 个“批次”必须 运行 由于输入计数为 10,而只有 8 个线程可用。

如果您将线程数增加到 10 个或更多(默认值为 50),则总体 运行 时间将减少到 2 秒加上开销,因为所有作业都会同时 运行。

警告:以上数字源于 运行ning in PowerShell Core,Microsoft 版本 Windows 10 Pro(64 位;版本 1903),使用 ThreadJob 模块的版本 2.0.1。
令人费解的是,相同的代码在 Windows PowerShell、v5.1.18362.145.

慢得多

但是,对于性能和内存消耗,最好在您的情况下使用 批处理 (分块),即处理 多个 每个线程的文件对

以下解决方案演示了这种方法;调整 $chunkSize 以找到适合您的批量大小。

# Create sample CSV file with 10 rows.
$FileList = Join-Path ([IO.Path]::GetTempPath()) "tmp.$PID.csv"
@'
Foo,SrcFileName,DestFileName,Bar
1,c:\tmp\a,\server\share\a,baz
2,c:\tmp\b,\server\share\b,baz
3,c:\tmp\c,\server\share\c,baz
4,c:\tmp\d,\server\share\d,baz
5,c:\tmp\e,\server\share\e,baz
6,c:\tmp\f,\server\share\f,baz
7,c:\tmp\g,\server\share\g,baz
8,c:\tmp\h,\server\share\h,baz
9,c:\tmp\i,\server\share\i,baz
10,c:\tmp\j,\server\share\j,baz
'@ | Set-Content $FileList

# How many threads at most to run concurrently.
$NumCopyThreads = 8

# How many files to process per thread
$chunkSize = 3

# The script block to run in each thread, which now receives a
# $chunkSize-sized *array* of file pairs.
$jobScriptBlock = { 
  param([pscustomobject[]] $filePairs)
  $simulatedRuntimeMs = 2000 # How long each job (thread) should run for.
  # Delay output for a random period.
  $randomSleepPeriodMs = Get-Random -Minimum 100 -Maximum $simulatedRuntimeMs
  Start-Sleep -Milliseconds $randomSleepPeriodMs
  # Produce output for each pair.  
  foreach ($filePair in $filePairs) {
    "Copied $($filePair.SrcFileName) to $($filePair.DestFileName)"
  }
  # Wait for the remainder of the simulated runtime.
  Start-Sleep -Milliseconds ($simulatedRuntimeMs - $randomSleepPeriodMs)
}

Write-Host 'Creating jobs...'
$dtStart = [datetime]::UtcNow

$jobs = & {

  # Process the input objects in chunks.
  $i = 0
  $chunk = [pscustomobject[]]::new($chunkSize)
  Import-Csv $FileList | Select-Object SrcFileName, DestFileName | ForEach-Object {
    $chunk[$i % $chunkSize] = $_
    if (++$i % $chunkSize -ne 0) { return }
    # Note the need to wrap $chunk in a single-element helper array (, $chunk)
    # to ensure that it is passed *as a whole* to the script block.
    Start-ThreadJob -ThrottleLimit $NumCopyThreads -ArgumentList (, $chunk) -ScriptBlock $jobScriptBlock
    $chunk = [pscustomobject[]]::new($chunkSize) # we must create a new array
  }

  # Process any remaining objects.
  # Note: $chunk -ne $null returns those elements in $chunk, if any, that are non-null
  if ($remainingChunk = $chunk -ne $null) { 
    Start-ThreadJob -ThrottleLimit $NumCopyThreads -ArgumentList (, $remainingChunk) -ScriptBlock $jobScriptBlock
  }

}

Write-Host "Waiting for $($jobs.Count) jobs to complete..."

# Synchronously wait for all jobs (threads) to finish and output their results
# *as they become available*, then remove the jobs.
# NOTE: Output will typically NOT be in input order.
Receive-Job -Job $jobs -Wait -AutoRemoveJob
Write-Host "Total time lapsed: $([datetime]::UtcNow - $dtStart)"

# Clean up the temp. file
Remove-Item $FileList

虽然输出实际上是相同的,但请注意这次只创建了 4 个作业,每个作业处理(最多)$chunkSize3) 文件对。


至于你试过的

您显示的屏幕截图表明问题在于您的自定义 class、[fileToCopy]Invoke-Async 的脚本块 运行 不可见。

由于 Invoke-Async 通过 PowerShell SDK 在单独的 运行 空间中调用脚本块,这些空间对调用者的状态一无所知,因此可以预期这些 运行 空间不会不知道你的class(这同样适用于Start-ThreadJob)。

但是,不清楚为什么这会在您的代码中出现问题,因为 您的脚本块没有明确引用您 class:您的脚本-block 参数 $file 不受类型限制(隐含地 [object] 类型)。

因此,只需访问脚本块中自定义 class 实例的 属性 应该 工作,并且确实我在 Microsoft Windows 10 Pro(64 位;版本 1903)上对 Windows PowerShell v5.1.18362.145 进行测试。

但是,如果您的真实脚本块代码明确引用自定义 class [fileToCopy] - 例如通过将参数定义为 param([fileToToCopy] $file) - 您 看到症状


[1] 在 Windows PowerShell v3 和 v4 中,PowerShellGet 模块不附带,默认情况下 Install-Module 不可用。但是,模块可以按需安装,如Installing PowerShellGet.

中所述