获取 Text.RegularExpressions.Regex 匹配项的行号

Question

我使用 PowerShell 解析日志文件目录并从日志文件中提取所有 XML 条目。这工作得很好。然而，由于一个日志文件可以包含许多这样的 xml 点点滴滴，我想把它找到的特定匹配项的行号也放在我写的 XML 文件的文件名中，所以我可以打开日志文件并跳转到该特定行以进行一些根本原因分析。

有一个字段 "index" 我认为它是字符数，这可能会引导我到行号，但我认为 "index" 以某种方式包含其他东西作为 Measure-Object -字符，因为 Index 的值大于用 Measure-Object-Character 找到的大小，例如$m.groups[0].Captures[0].Index 是 9963166 但日志目录中的 Measure-Object -Character 整体文件最大为 9838833，所以我认为它也计算换行符。

所以问题大概是：如果 matches 将 "index" 作为属性传送给我，我如何知道 "index" 包含多少个换行符？我是否必须从文件中获取 "index" 个字符，然后检查它包含多少个换行符然后我得到该行？应该是吧。

$tag = 'data_json'
$input_dir = $absolute_root_dir + $specific_dir
$output_dir = $input_dir  + 'ParsedDataFiles\'
$OFS = "`r`n"
$nice_specific_dir = $specific_dir.Replace('\','_')
$nice_specific_dir = $nice_specific_dir.Replace(':','_')
$regex = New-Object Text.RegularExpressions.Regex "<$tag>(.+?)<\/$tag>", ('singleline', 'multiline')
New-Item -ItemType Directory -Force -Path $output_dir
Get-ChildItem -Path $input_dir -Name -File | % {   
    $output_file = $output_dir + $nice_specific_dir + $_ + '.'
    $content = Get-Content ($input_dir + $_)
    $i = 0
    foreach($m in $regex.Matches($content)) {        
        $outputfile_xml = $output_file + $i++ + '.xml'
        $outputfile_txt = $output_file + $i++ + '.txt'
        $xml = [xml] ("<" + $tag+ ">" + $m.Groups[1].Value + "</" + $tag + ">")
        $xml.Save($outputfile_xml)
        $j = 0
        $xml.data_json.Messages.source.item | % { $_.SortOrder + ", " + $_.StartOn + ", " + $_.EndOn + ", " + $_.Id } | sort | %  { 
            (($j++).ToString() + ", " + $_ )   | Out-File $outputfile_txt -Append
        }
    }
}

Answer 1

我不会假装我知道你想做什么，但正则表达式无法计算。这意味着你必须以不同的方式来做。

方式 1:

您可以创建两个群组。第 1 组匹配所有直到您要匹配的是第 2 组。(?s)(.*?)(match_me)
然后当你得到匹配时，从第 1 组创建一个字符串，然后运行一个正则表达式到
匹配换行符 \r?\n。每场比赛你都会增加一个计数器，
包含原始 match_me 事物的行号。

方式 2:

如果 powershell 使用 Dot-Net 引擎 =>
使用此代码匹配此正则表达式（使其与 powershell 兼容，这是 C#）：

int line_num = 0;
var Rx = new Regex(@"^(?:.*((?:\r?\n)?))*?(match_me)");
Match M = Rx.Match(str);
if (M.Success) {
   CaptureCollection ccLineBreaks = M.Groups[1].Captures;
   line_num = ccLineBreaks.Count;
}

第1组的捕获集合count为个别换行的数量
直到 match_me 事情。

Answer 2

^{注意：如果您的正则表达式匹配的内容保证永远不会跨越多行，即如果匹配的文本保证在单行上，考虑一个更简单的基于 Select-String 的解决方案，如所示；不过，一般来说，method/expression-based 解决方案（如本答案中所示）的性能会更好。}

你的第一个问题是你使用 Get-Content 而没有 -Raw，它将输入文件读取为 行数组 而不是单个，多个-行字符串。

当您将此数组传递给 $regex.Matches() 时，PowerShell 通过将元素 与空格 连接起来（默认情况下）来将数组字符串化。

因此，使用 Get-Content -Raw 读取您的输入文件，这 确保它被读取为单个多行字符串，换行符完好无损:

# Read entire file as single string
$content = Get-Content -Raw ($input_dir + $_)

一旦匹配多行字符串，您可以通过计算子字符串中的行数直到找到每个匹配项的字符索引来推断行号，通过 .Substring() 和 Measure-Object -Line:

这是一个简化的独立示例（如果您还想确定列编号，请参阅底部部分）：

# Sample multi-line input.
# Note: The <title> elements are at lines 3 and 6, respectively.
$content = @'
<catalog>
  <book id="bk101">
    <title>De Profundis</title>
  </book>
  <book id="bk102">
    <title>Pygmalion</title>
  </book>
</catalog>
'@

# Regex that finds all <title> elements.
# Inline option ('(?...)') 's' makes '.' match newlines too
$regex = [regex] '(?s)<title>.+?</title>'

foreach ($m in $regex.Matches($content)) {
  $lineNumber = ($content.Substring(0, $m.Index + 1) | Measure-Object -Line).Lines
  "Found '$($m.Value)' at index $($m.Index), line $lineNumber"
}

^{注意 $m.Index + 1 中的 + 1，这是确保子字符串不会 end 换行符所必需的，因为 Measure-Object 行会忽略这样的尾随换行符。通过包含至少一个额外的（非换行符）字符，即匹配元素的 <，行数始终正确，即使匹配元素从第一列开始。}

以上结果：

Found '<title>De Profundis</title>' at index 34, line 3
Found '<title>Pygmalion</title>' at index 96, line 6

如果你想也得到列数字（开始的字符的基于1的索引找到的行上的匹配项):

确定多行字符串中正则表达式匹配的行号和列号：

# Sample multi-line input.
# Note: The <title> elements are at lines 3 and 6, columns 5 and 7, respectively.
$content = @'
<catalog>
  <book id="bk101">
    <title>De Profundis</title>
  </book>
  <book id="bk102">
      <title>Pygmalion</title>
  </book>
</catalog>
'@

# Regex that finds all <title> elements, along with the
# string that precedes them on the same line:
# Due to use of capture groups, each match $m will contain:
#  * the matched element: $m.Groups[2].Value
#  * the preceding string on the same line: $m.Groups[1].Value
# Inline options ('(?...)'):
#   * 's' makes '.' match newlines too
#   * 'm' makes '^' and '$' match the starts and ends of *individual lines*
$regex = [regex] '(?sm)(^[^\n]*)(<title>.+?</title>)'

foreach ($m in $regex.Matches($content)) {
  $lineNumber = ($content.Substring(0, $m.Index + 1) | Measure-Object -Line).Lines
  $columnNumber = 1 + $m.Groups[1].Value.Length
  "Found '$($m.Groups[2].Value)' at line $lineNumber, column $columnNumber."
}

以上结果：

Found '<title>De Profundis</title>' at line 3, column 5.
Found '<title>Pygmalion</title>' at line 6, column 7.

注意：为简单起见，上述两种解决方案在每次迭代中都从字符串的开头计算行数。
在大多数情况下，这可能仍然表现良好；如果不是，请参阅下面的性能基准中的变体方法，其中迭代计算行数，在给定迭代中仅计算当前匹配项和上一个匹配项之间的行数。

可选阅读：行计数方法的性能比较：

建议也使用正则表达式进行 行计数 .

在性能方面比较这些方法以及上面的 .Substring() 加 Measure-Object -Line 方法可能很有趣。

以下测试基于Time-Command function.

示例结果来自 macOS 10.14.6 上的 PowerShell Core 7.0.0-preview.3，平均超过 100 次运行；绝对数字会因执行环境而异，但方法的相对排名（Factor 列）在平台和 PowerShell 版本之间似乎是相似的：

有 1,000 行，最后一行有 1 个匹配项：

Factor Secs (100-run avg.) Command
------ ------------------- -------
1.00   0.001               # .Substring() + Measure-Object -Line, count iteratively…
1.07   0.001               # .Substring() + Measure-Object -Line, count from start…
2.22   0.002               # Repeating capture with nested newline capturing…
6.12   0.006               # Prefix capture group + Measure-Object -Line…
6.72   0.007               # Prefix capture group + newline-matching regex…
7.24   0.007               # Prefix Capture group + -split…

从第 1,000 行开始有 20,000 行和 20 个均匀间隔的匹配项：

Factor Secs (100-run avg.) Command
------ ------------------- -------
1.00   0.014               # .Substring() + Measure-Object -Line, count iteratively…
2.92   0.042               # Repeating capture with nested newline capturing…
7.50   0.107               # .Substring() + Measure-Object -Line, count from start…
8.39   0.119               # Prefix capture group + Measure-Object -Line…
9.50   0.135               # Prefix capture group + newline-matching regex…
9.94   0.141               # Prefix Capture group + -split…

注释和结论：

Prefix capture group 指的是来自 sln 答案的 "Way1" 的（变体），而 Repeating capture group ... 指的是 "Way2".
- 注意：对于 Way2，下面使用了一个（改编的）正则表达式 (?:.*(\r?\n))*?.*?(match_me)，这是一个大大改进的版本 sln，稍后在评论中添加，而该版本仍在其正文中显示答案（截至撰写本文时）- ^(?:.*((?:\r?\n)?))*?(match_me) - 不适用于处理多个匹配循环 .
这个答案中的 .Substring() + Measure-Object -Line 方法在所有情况下都是最快的，但是，有很多匹配要循环，只有在迭代、匹配之间执行行计数 (.Substring() + Measure-Object -Line, count iteratively…)，而上述解决方案为简单起见使用从一开始就为每个匹配计算行数的方法 (# .Substring() + Measure-Object -Line, count from start…)。
使用 Way1 方法 (Prefix capture group)，用于计算前缀匹配中换行符的特定方法差异相对较小，尽管 Measure-Object -Line 也是最快的.

这是测试的源代码；通过修改底部附近的各种变量，可以很容易地尝试匹配计数、输入行总数……：

# The script blocks with the various approaches.
$sbs =
  { # .Substring() + Measure-Object -Line, count from start
    foreach ($m in [regex]::Matches($txt, 'found')) {
      # !! Measure-Object -Line ignores a trailing \n, so if the match is at the
      # !! start of a line, we need to include at least 1 additional character for the line to register.
      $lineNo = ($txt.Substring(0, $m.Index + 1) | Measure-Object -Line).Lines
      "Found at line $lineNo (substring() + Measure-Object -Line, counted from start every time)."
    }
  },
  { # .Substring() + Measure-Object -Line, count iteratively
    $lineNo = 0; $startNdx = 0
    foreach ($m in [regex]::Matches($txt, 'found')) {
      # !! Measure-Object -Line ignores a trailing \n, so if the match is at the
      # !! start of a line, we need to include at least 1 additional character for the line to register.
      $lineNo += ($txt.Substring($startNdx, $m.Index + 1 - $startNdx) | Measure-Object -Line).Lines
      "Found at line $lineNo (substring() + Measure-Object -Line, counted iteratively)."
      $startNdx = $m.Index + $m.Value.Length
      # !! Decrement the cumulative line number to compensate for counting starting at 1 also for subsequent matches.
      --$lineNo
    }
  },
  { # Prefix capture group + Measure-Object -Line
    $lineNo = 0
    foreach ($m in [regex]::Matches($txt, '(?s)(.*?)found')) {
      # !! Measure-Object -Line ignores a trailing \n, so if the match is at the
      # !! start of a line, we need to include at least 1 additional character for the line to register.
      $lineNo += ($m.Groups[1].Value + '.' | Measure-Object -Line).Lines
      "Found at line $lineNo (prefix capture group + Substring() + Measure-Object -Line)."
      # !! Decrement the cumulative line number to compensate for counting starting at 1 also for subsequent matches.
      --$lineNo
    }
  },
  { # Prefix capture group + newline-matching regex
    $lineNo = 0
    foreach ($m in [regex]::Matches($txt, '(?s)(.*?)found')) {
      $lineNo += 1 + [regex]::Matches($m.Groups[1].Value, '\r?\n').Count
      "Found at line $lineNo (prefix capture group + newline-matching regex)."
      # !! Decrement the cumulative line number to compensate for counting starting at 1 also for subsequent matches.
      --$lineNo
    }
  },
  { # Prefix Capture group + -split
    $lineNo = 0
    foreach ($m in [regex]::Matches($txt, '(?s)(.*?)found')) {
      $lineNo += ($m.Groups[1].Value -split '\r?\n').Count
      "Found at line $lineNo (prefix capture group + -split for counting)."
      # !! Decrement the cumulative line number to compensate for counting starting at 1 also for subsequent matches.
      --$lineNo
    }
  },
  { # Repeating capture with nested newline capturing
    $lineNo = 0
    foreach ($m in [regex]::Matches($txt, '(?:.*(\r?\n))*?.*?found')) {
      $lineNo += 1 + $m.Groups[1].Captures.Count
      "Found at line $lineNo (repeating prefix capture group with newline capture)."
      # !! Decrement the cumulative line number to compensate for counting starting at 1 also for subsequent matches.
      --$lineNo
    }
  }

# Set this to 1 for debugging:
#   * runs the script blocks only once
#   * with 3 matching strings in the put.
#   * shows output so that the expected functionality (number of matches, line numbers) can be verified.
$debug = 0

$matchCount = if ($debug) { 3 } else {
  20 # Set how many matching strings should be present in the input string.
}

# Sample input:
# Create N lines that are 60 chars. wide, with the string to find on the last line...
$n = 1e3 # Set the number of lines per match.
$txt = ((1..($n-1)).foreach('ToString', '0' * 60) -join "`n") + "`n  found`n"
# ...and multiply the original string according to how many matches should be present.
$txt = $txt * $matchCount

$runsToAverage = if ($debug) { 1 } else {
  100   # Set how many test runs to report average timing for.
}
$showOutput = [bool] $debug

# Run the tests.
Time-Command -Count $runsToAverage -OutputToHost:$showOutput $sbs

Answer 3

Select-string 会告诉你行号。

dir file | select-string two | fl

输出

IgnoreCase : True
LineNumber : 2
Line       : two
Filename   : file
Path       : /Users/js/foo/file
Pattern    : two
Context    : 
Matches    : {0}

获取 Text.RegularExpressions.Regex 匹配项的行号

Get Line number of Text.RegularExpressions.Regex matches

regex

powershell

logging

regex-group

确定多行字符串中正则表达式匹配的行号和列号：

可选阅读：行计数方法的性能比较：