如何将缺失的行（包含页码的 URL）添加到数组（如 Linux 中的 seq）

Question

我有一个由以下形式的 URL 组成的数组：

$URLs = @("https://somesite.com/folder1/page/1/"
,"https://somesite.com/folder222/page/1/"
,"https://somesite.com/folder222/page/2/"
,"https://somesite.com/folder444/page/1/"
,"https://somesite.com/folder444/page/3/"
,"https://somesite.com/folderBBB/page/1/"
,"https://somesite.com/folderBBB/page/5/")

他们总是有 /page/1/，我需要添加（或重建）从最高页面到 1 的所有缺失的 URL，所以它最终是这样的：

$URLs = @("https://somesite.com/folder1/page/1/"
,"https://somesite.com/folder222/page/1/"
,"https://somesite.com/folder222/page/2/"
,"https://somesite.com/folder444/page/1/"
,"https://somesite.com/folder444/page/2/"
,"https://somesite.com/folder444/page/3/"
,"https://somesite.com/folderBBB/page/1/"
,"https://somesite.com/folderBBB/page/2/"
,"https://somesite.com/folderBBB/page/3/"
,"https://somesite.com/folderBBB/page/4/"
,"https://somesite.com/folderBBB/page/5/")

我想伪代码应该是这样的：

对于每个文件夹，提取最高页码：

hxxps://somesite.com/folderBBB/page/5/

将此从 (5) 扩展到 (1)

 hxxps://somesite.com/folderBBB/page/1/
  hxxps://somesite.com/folderBBB/page/2/
  hxxps://somesite.com/folderBBB/page/3/
  hxxps://somesite.com/folderBBB/page/4/
  hxxps://somesite.com/folderBBB/page/5/

将其输出到数组中

欢迎大家指点！

Answer 1

您可以通过 Group-Object cmdlet 使用 pipeline-based 解决方案，如下所示：

$URLs = @("https://somesite.com/folder1/page/1/"
  , "https://somesite.com/folder222/page/1/"
  , "https://somesite.com/folder222/page/2/"
  , "https://somesite.com/folder444/page/1/"
  , "https://somesite.com/folder444/page/3/"
  , "https://somesite.com/folderBBB/page/1/"
  , "https://somesite.com/folderBBB/page/5/")

$URLs |
  Group-Object { $_ -replace '[^/]+/$' } | # Group by shared prefix
    ForEach-Object {
      # Extract the start and end number for the group at hand.
      [int] $from, [int] $to = 
        ($_.Group[0], $_.Group[-1]) -replace '^.+/([^/]+)/$', ''
      # Generate the output URLs.
      # You can assign the entire pipeline to a variable 
      # ($generatedUrls = $URLs | ...) to capture them in an array.
      foreach ($i in $from..$to) { $_.Name + $i + '/' }
    }

注：

假设每组 URL 中共享相同前缀的第一个和最后一个元素始终分别包含所需枚举的起点和终点。
- 如果该假设不成立，请改用以下内容：
```
$minMax = $_.Group -replace '^.+/([^/]+)/$', '' |
            Measure-Object -Minimum -Maximum
$from, $to = $minMax.Minimum, $minMax.Maximum
```
regex-based -replace operator 用于两件事：
- -replace '[^/]+/$' 从每个 URL 中删除最后一个组件，以便按它们的共享前缀对它们进行分组。
- -replace '^.+/([^/]+)/$', '' 有效地从每个给定的 URL 中提取最后一个组件，即代表所需枚举的起点和终点的数字。

程序替代:

# Build a map (ordered hashtable) that maps URL prefixes
# to the number suffixes that occur among the URLs sharing
# the same prefix.
$map = [ordered] @{}
foreach ($url in $URLs) {
  if ($url -match '^(.+)/([^/]+)/') {
    $prefix, [int] $num = $Matches[1], $Matches[2]
    $map[$prefix] = [array] $map[$prefix] + $num
  }
}

# Process the map to generate the URLs.
# Again, use something like
#    $generatedUrls = foreach ...
# to capture them in an array.
foreach ($prefix in $map.Keys) {
  $nums = $map[$prefix]
  $from, $to = $nums[0], $nums[-1]
  foreach ($num in $from..$to) {
    '{0}/{1}/' -f $prefix, $num  # synthesize URL and output it.
  }
}

如何将缺失的行（包含页码的 URL）添加到数组（如 Linux 中的 seq）

How to add missing rows (URLs containing page numbers) to an array (like seq in Linux)

powershell

automation

web-scraping