如何将正则表达式匹配组放入单独的输出列中，正确处理 missing/empty 值？

Question

如果我有以下文件：

This file has two lines
This file has three lines
This file has four
This file has five lines

我想对 file 和 lines 进行 grep 以便得到以下输出：

file lines
file lines
file
file lines

如果每行都找到两个匹配项，则在同一行上打印匹配项。如果只找到一个，打印它，留下一个占位符（null/blank/whatever）然后移到下一行。

我试过这样做：

grep -oP '(file)|(lines)' example.txt | paste -d ' ' - -

但我得到：

file lines
file lines
file file
lines

因为在第三行没有找到lines，它从下一行找到file，并放在同一输出行。

我基本上是在强制 paste 填充输出中的空位，而不管每行中找到什么。

我该如何更改？

Answer 1

我假设 file 和 lines 实际上是具有自己的匹配组的正则表达式。以下允许就地使用任何 ERE：

#!/usr/bin/env bash

# replace these with any ERE-compliant regex of your choice
file_re='(file)'    # for instance: file_re='file=([^[:space:]]+)([[:space]]|$)'
lines_re='(lines)'

while IFS= read -r line; do
  # default to a blank placeholder if no matches exist
  file= lines=

  # compare against each regex; if one matches, assign the group contents to a variable
  [[ $line =~ $file_re ]] && file=${BASH_REMATCH[1]}
  [[ $line =~ $lines_re ]] && lines=${BASH_REMATCH[1]}

  # print a line of output if *either* regex matched.
  [[ $file || $lines ]] && printf '%s\t%s\n' "$file" "$lines"

done <"${1:-example.txt}" # with input from  if given, or example.txt otherwise

请参阅 BashFAQ #1 ("How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?") 了解此处使用的技术的说明。

根据您给定的输入，输出为：

file    lines
file    lines
file
file    lines

Answer 2

sed 用于 s/old/new/，grep 用于 g/re/p。对于任何其他文本操作，您应该使用 awk。

使用 GNU awk 匹配第三个参数():

$ awk '{f=match([=10=],/file/,a); f+=match([=10=],/lines/,b)} f{print a[0], b[0]}' file
file lines
file lines
file
file lines

对于其他 awk，您将使用 substr() 来捕获匹配的字符串：

$ awk '{f=match([=11=],/file/); a=substr([=11=],RSTART,RLENGTH); f+=match([=11=],/lines/); b=substr([=11=],RSTART,RLENGTH)} f{print a, b}' file
file lines
file lines
file
file lines

如何将正则表达式匹配组放入单独的输出列中，正确处理 missing/empty 值？

How do I put regex match groups into separate output columns, correctly handling missing/empty values?

bash

grep

text-processing