使用正则表达式将原始文本字符串的子字符串输出到换行符

Question

我有一个名称分隔符，我想用它来提取找到它的整行。

[string]$testString = $null

# broken text string of text & newlines which simulates $testString = Get-Content -Raw

$testString = "initial text
preliminary text
unfinished line bfore the line I want
001 BOURKE, Bridget Mary ....... ........... 13 Mahina Road, Mahina Bay.Producrs/As 002 BOURKE. David Gerard ...
line after the line I want
extra text
extra extra text"

# test1
# simulate text string before(?<content>.*)text string after - this returns "initial text" only (no newline or anything after)
# $testString -match "(?<BOURKE>.*)"

# test2
# this returns all text, including the newlines, so that $testString outputs exactly as it is defined 
$testString -match "(?s)(?<BOURKE>.*)"

#test3
# I want just the line with BOURKE

$result = $matches['BOURKE']

$result

#Test1 找到匹配但只打印到换行符。 #Test2 找到匹配项并包括所有换行符。我想知道强制输出开始的正则表达式模式是什么 001 BOURKE ...

如有任何建议，我们将不胜感激。

Answer 1

我发现最好让一场比赛消耗掉不需要的东西； \r\n。这可以通过在集合中使用 ^ 的集合命名法来完成，例如 [^\r\n]+ 表示最多消耗 \r 或 \n。因此，所有不是的东西都是 \r\n.

为此使用

$testString -match "(?<Bourke>\d\d\d\s[^\r\n]+)"

还应该尽量避免 * 当知道那里将是匹配的 txt...* 是一个 greedy 什么都吃的类型。使用 +，一个或多个，会大大限制匹配，因为解析器不必尝试模式（* 的零或多个），回溯其名称显然不合理。

Answer 2

注：

我假设您正在寻找整行 BOURKE 作为子串.
在您自己的尝试中，(?<BOURKE>...) 只是给正则表达式捕获组一个 self-chosen name (BOURKE) ，这与捕获组的子表达式 (...) 实际匹配的内容无关。
对于手头的用例，根本不需要使用（命名的）捕获组，因此下面的解决方案没有一个，当 -match operator is used, means that the result of a successful match is reported in index [0] of the automatic $Matches variable ，如下图。

如果您的多行输入字符串包含 仅 Unix-format LF 换行符 (\n)，请使用以下内容：

if ($multiLineStr -match '.*BOURKE.*') { $Matches[0] }

注：

要区分大小写，请使用-cmatch而不是-match。
如果您知道子字符串前面/后面有 至少一个 个字符，请使用 .+ 而不是 .*
如果要搜索子字符串 verbatim 并且它恰好或可能包含正则表达式元字符（例如 . ），请对其应用 [regex]::Escape() ;例如，[regex]::Escape('file.txt') 产生 file\.txt（\-转义元字符）。
如有必要，添加额外的约束以消除歧义，例如要求子字符串仅在单词边界处开始或结束 (\b)

如果有可能出现 Windows-format CLRF 换行符 (\r\n)，请使用：

if ($multiLineStr -match '.*BOURKE[^\r\n]*') { $Matches[0] }

有关正则表达式的解释以及使用它们进行试验的能力，请参阅 this regex101.com page for .*BOURKE.*, and this one 以获得 .*BOURKE[^\r\n]*

简而言之：

默认情况下，. 匹配除 \n 之外的任何字符，这样就不需要 line-specific 锚点（^和 $)，但对于 CRLF 换行符，需要排除 \r，以免将其捕获为匹配的一部分。^[1]

两条旁白：

PowerShell 的 -match 运算符只查找一个匹配；如果您需要 查找所有匹配项 ，您目前需要使用底层 [regex] API directly; e.g., [regex]::Matches($multiLineStr, '.*BOURKE[^\r\n]*').Value, 'IgnoreCase'
GitHub issue #7867 建议将此功能直接引入 PowerShell 的形式-matchall 运算符。
如果你想锚定要查找的子字符串，即如果你想规定它要么发生在一行的开头或结尾，你需要切换到multi-line模式（(?m)），这使得^和$ 匹配每行；例如，仅当 BOURKE 出现在行的开头时才匹配：
- if ($multiLineStr -match '(?m)^BOURKE[^\r\n]*') { $Matches[0] }

如果line-by-line处理是一个选项：

Line-by-line 处理的优点是您 不必担心换行格式的差异（假设处理拆分成行的实用程序可以处理两种换行格式，这通常适用于 PowerShell。
如果您正在从文件读取输入文本，Select-String cmdlet，其 目的是找到给定正则表达式或文字子字符串 (-SimpleMatch) 匹配 的整行，此外还提供 streaming处理，即逐行读取，而不需要将整个文件读入内存。

(Select-String -LiteralPath file.txt -Pattern BOURKE).Line

^{为case-sensitive匹配添加-CaseSensitive。}

下面的例子模拟了上面的情况（-split '\r?\n' 将多行输入字符串拆分成单独的行，识别任一换行格式）：

( @' initial text preliminary text unfinished line bfore the line I want 001 BOURKE, Bridget Mary ....... ........... 13 Mahina Road, Mahina Bay.Producrs/As 002 BOURKE. David Gerard ... line after the line I want extra text extra extra text '@ -split '\r?\n' | Select-String -Pattern BOURKE ).Line

输出：

001 BOURKE, Bridget Mary ....... ........... 13 Mahina Road, Mahina Bay.Producrs/As 002 BOURKE. David Gerard ...

^{[1] 严格来说，[^\r\n]* 也会在 \r 字符处停止匹配隔离（即，即使如果没有直接跟随\n）。如果排除这种情况很重要（这似乎不太可能），请在对问题的评论中使用 Mathias R. Jessen 建议的正则表达式（的简化版本）：.*BOURKE.*?(?=\r?\n)}

使用正则表达式将原始文本字符串的子字符串输出到换行符

Output Substring to Newline from a Raw Text String using Regex

regex

powershell

text