Powershell 7.x 如何 Select 仅使用边界子串的未知长度的文本子串

Powershell 7.x How to Select a Text Substring of Unknown Length Only Using Boundary Substrings

我正在尝试存储一个文本文件字符串,该字符串的开头和结尾使其成为原始文本文件的子字符串。我是 Powershell 的新手,所以我的方法是 simple/crude。基本上我的方法是:

  1. 从字符串的开头大致得到我想要的
  2. 担心以后剪掉我不想要的东西

我的最小可重现示例如下:

# selectStringTest.ps    
         
$inputFile = Get-Content -Path "C:\test\test3\Copy of 31832_226140__0001-00006.txt"

#  selected text string needs to span from $refName up to $boundaryName 
[string]$refName = "001 BARTLETT"
[string]$boundaryName = "001 BEECH"

# a rough estimate of the text file lines required
[int]$lines = 200
   
if (Select-String  -InputObject $inputFile -pattern $refName) {
    Write-Host "Selected shortened string found!"
    # this selects the start of required string but with extra text
    [string]$newFileStart = $inputFile | Select-String $refName -CaseSensitive -SimpleMatch -Context 0, $lines   
}
else {
    Write-Host "Selected string NOT FOUND."
}
# tidy up the start of the string by removing rubbish
$newFileStart = $newFileStart.TrimStart('> ')

# this is the kind of thing I want but it doesn't work
$newFileStart = $newFileStart - $newFileStart.StartsWith($boundaryName)

$newFileStart | Out-File tempOutputFile

原样:输出正确开始,但我无法删除包括 $boundaryName

和之后的文本

原始文本文件是OCR生成的(光学字符识别)所以格式不均匀。奇怪的地方有换行符。所以在分隔方面我的选择有限。

我不确定我的 if (Select-String -InputObject $inputFile -pattern $refName) 是否有效。它似乎工作正常。总体设计看起来很粗糙。在那我猜测我需要多少行。最后,我尝试了多种方法来修剪 $boundaryName 中的字符串,但均未成功。为此:

如有任何建议,我们将不胜感激。

x2 200 列表单个 Copy of 31832_226140__0001-00006.txt 文件的缩写内容是:

文本文件的开头

________________

BARTLETT-BEDGGOOD
PENCARROW COMPOSITE ROLL
PAGE 6
PAGE 7
PENCARROW COMPOSITE ROLL
BEECH-BEST
www.
.......................
001 BARTLETT. Lois Elizabeth

文本文件中间

............. 15 St Ronans Av. Lower Hutt Marned 200 BEDGGOOD. Percy Lloyd
............15 St Ronans Av, Lower Mutt. Coachbuild
001 BEECH, Margaret ..........

文本文件结束

..............312 Munita Rood Eastbourne, Civil Eng 200 BEST, Dons Amy .........
..........50 Man Street, Wamuomata, Marned
SO NON

要跨换行使用正则表达式,需要将文件作为单个字符串读取。 Get-Content -Raw 会那样做。这假设您不希望输出中包含包含 refName 和 boundaryName 的行

$c = Get-Content -Path '.\beech.txt' -Raw
$refName = "001 BARTLETT"
$boundaryName = "001 BEECH"

if ($c -match "(?smi).*$refName.*?`r`n(.*)$boundaryName.*?`r`n.*") {
    $result = $Matches[1]
}
$result

更多信息请见

这与您想要的有多接近?

function Process-File {
    param (
        [Parameter(Mandatory = $true, Position = 0)]
        [string]$HeadText,
        [Parameter(Mandatory = $true, Position = 1)]
        [string]$TailText,
        [Parameter(ValueFromPipeline)]
        $File
    )
    Process {
        $Inside = $false;
        switch -Regex -File $File.FullName {
            #'^\s*$' { continue }
            "(?i)^\s*$TailText(?<Tail>.*)`$"    { $Matches.Tail; $Inside = $false }
            '^(?<Line>.+)$'                     { if($Inside) { $Matches.Line } }
            "(?i)^\s*$HeadText(?<Head>.*)`$"    { $Matches.Head; $Inside = $true }
            default { continue }
        }
    }
}
$File = 'Copy of 31832_226140__0001-00006.txt'
#$Path = $PSScriptRoot
$Path = 'C:\test\test3'

$Result = Get-ChildItem -Path "$Path$File" | Process-File '001 BARTLETT' '001 BEECH'
$Result | Out-File -FilePath "$Path\SpanText.txt"

这是输出:

. Lois Elizabeth
............. 15 St Ronans Av. Lower Hutt Marned 200 BEDGGOOD. Percy Lloyd
............15 St Ronans Av, Lower Mutt. Coachbuild
, Margaret ..........