awk 语言中的 RS

RS in awk language

我正在学习 awk 编程语言,但遇到了一个问题。

我有一个文件(awk.dat),内容如下:

Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Maecenas pellentesque erat vel tortor consectetur condimentum.
Nunc enim orci, euismod id nisi eget, interdum cursus ex.
Curabitur a dapibus tellus.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Aliquam interdum mauris volutpat nisl placerat, et facilisis.

我正在使用以下命令:

awk 'BEGIN{RS="*, *";ORS="<<<---\n"} {print [=11=]}' awk.dat

它返回错误:

awk: run time error: regular expression compile failed (missing operand)
*, *
    FILENAME="" FNR=0 NR=0

同时,如果我使用命令:awk 'BEGIN{RS=" *, *";ORS="<<<---\n"} {print [=14=]}' awk.dat,它会给我所需的结果。

我需要了解这部分:RS=" *, *",双引号之间的 space 和 , 之前的 * 的含义,因此它抛出了错误。

预期输出:

Lorem ipsum dolor sit amet<<<---
consectetur adipiscing elit.
Maecenas pellentesque erat vel tortor consectetur condimentum.
Nunc enim orci<<<---
euismod id nisi eget<<<---
interdum cursus ex.
Curabitur a dapibus tellus.
Lorem ipsum dolor sit amet<<<---
consectetur adipiscing elit.
Aliquam interdum mauris volutpat nisl placerat<<<---
et facilisis.
<<<---

谢谢。

"[space1]*,[space2]*"

是一个正则表达式,它匹配字符串:

零个或多个空格 (space1) 后跟一个逗号,然后是零个或多个空格 (space2)

第一个"*,[space]*"是错误的,因为*在正则表达式中有特殊含义。这意味着重复匹配的 group/character 零次或多次。不能放在最开始。

请注意,根据 POSIX,RS 被定义为单个字符,不是 正则表达式。

The first character of the string value of RS shall be the input record separator; a <newline> by default. If RS contains more than one character, the results are unspecified. If RS is null, then records are separated by sequences consisting of a <newline> plus one or more blank lines, leading or trailing blank lines shall not result in empty records at the beginning or end of the input, and a <newline> shall always be a field separator, no matter what the value of FS is.

source: Awk Posix standard

这意味着 RS=" *, *" 会导致 未定义的行为

其他版本的 awk,实现了对 POSIX 的扩展,可能对 RS 所代表的含义有不同的处理方式。例如 GNU awk 和 mawk。两者都将 RS 实现为正则表达式,但两者的实现略有不同。 用法的总结是:

| RS   | awk (posix)  | gawk             | mawk             |
|------+--------------+------------------+------------------|
| "*"  | "<asterisk>" | "<asterisk>"     | "<asterisk>"     |
| "*c" | undefined    | "<asterisk>c"    | undefined        |
| "c*" | undefined    | "","c","ccc",... | "","c","ccc",... |

c is any character

以上应该可以解释 OP 的错误,因为根据 mawk,RS="*, *" 是一个无效的正则表达式。

$ echo "abc" | ./mawk '/*c/'
mawk: line 1: regular expression compile failed (missing operand)

GNU awk: GNU awk 手册说明如下:

When using gawk, the value of RS is not limited to a one-character string. It can be any regular expression (see Regexp). (c.e.) In general, each record ends at the next string that matches the regular expression; the next record starts at the end of the matching string.

source: GNU awk manual

要了解 在 GNU awk 正则表达式中的用法,我们发现:

<asterisk> * This symbol means that the preceding regular expression should be repeated as many times as necessary to find a match. For example, ph* applies the * symbol to the preceding h and looks for matches of one p followed by any number of hs. This also matches just p if no hs are present.

There are two subtle points to understand how * works. First, the * applies only to the single preceding regular expression component (e.g., in ph*, it applies just to the h). To cause * to apply to a larger subexpression, use parentheses: (ph)* matches ph, phph, phphph, and so on.

Second, * finds as many repetitions as possible. If the text to be matched is phhhhhhhhhhhhhhooey, ph* matches all of the hs.

source: GNU Regular expression operators

必须提到的是:

In POSIX awk and gawk, the *, + and ? operators stand for themselves when there is nothing in the regexp that precedes them. For example, /+/ matches a literal plus sign. However, many other versions of awk treat such a usage as a syntax error.

source: GNU Regular expression operators

因此,设置 RS="*, *" 意味着它将匹配字符串 "*,""*, ""*, "、...

$ echo "a,b, c" | awk 'BEGIN{RS="*, *"}1'
a,b, c
$ echo "a*,b, c" | awk 'BEGIN{RS="*, *"}1'
a
b, c

mawk: GNU awk 手册说明如下:

12. Multi-line records
Since mawk interprets RS as a regular expression, multi-line records are easy.

source: man mawk

但是

11. Splitting strings, records and files
Awk programs use the same algorithm to split strings into arrays with split(), and records into fields on FS. mawk uses essentially the same algorithm to split files into records on RS.

Split(expr,A,sep) works as follows:

  1. <snip>
  2. If sep = " " (a single space), then <SPACE> is trimmed from the front and back of expr, and sep becomes <SPACE>. mawk defines <SPACE> as the regular expression /[ \t\n]+/. Otherwise sep is treated as a regular expression, except that meta-characters are ignored for a string of length 1, e.g., split(x, A, "*") and split(x, A, /\*/) are the same.
  3. <snip>

source: man mawk

手册没有提及应如何解释以元字符开头的正则表达式(例如“*c”)


注意: 在 GNU awk 部分,我删除了 POSIX awk,因为根据 POSIX,[=] 形式的正则表达式51=] 导致未定义的行为。 (这与定义 RS 无关,因为 RS 无论如何都不是 POSIX awk 中的 ERE)

The awk utility shall make use of the extended regular expression notation (see XBD Extended Regular Expressions)

source: Awk Posix standard

*+?{ The <asterisk>, <plus-sign>, <question-mark>, and <left-brace> shall be special except when used in a bracket expression (see RE Bracket Expression). Any of the following uses produce undefined results:

  • If these characters appear first in an ERE, or immediately following an unescaped <vertical-line>, <circumflex>, <dollar-sign>, or <left-parenthesis>
  • If a <left-brace> is not part of a valid interval expression (see EREs Matching Multiple Characters)

source: POSIX Extended Regular Expressions

能否请您尝试关注一次。

awk '{gsub(", ","<<<---" ORS)} 1;END{print "<<<---"}'   Input_file