awk 语言中的 RS
RS in awk language
我正在学习 awk 编程语言,但遇到了一个问题。
我有一个文件(awk.dat),内容如下:
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Maecenas pellentesque erat vel tortor consectetur condimentum.
Nunc enim orci, euismod id nisi eget, interdum cursus ex.
Curabitur a dapibus tellus.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Aliquam interdum mauris volutpat nisl placerat, et facilisis.
我正在使用以下命令:
awk 'BEGIN{RS="*, *";ORS="<<<---\n"} {print [=11=]}' awk.dat
它返回错误:
awk: run time error: regular expression compile failed (missing operand)
*, *
FILENAME="" FNR=0 NR=0
同时,如果我使用命令:awk 'BEGIN{RS=" *, *";ORS="<<<---\n"} {print [=14=]}' awk.dat
,它会给我所需的结果。
我需要了解这部分:RS=" *, *"
,双引号之间的 space 和 ,
之前的 *
的含义,因此它抛出了错误。
预期输出:
Lorem ipsum dolor sit amet<<<---
consectetur adipiscing elit.
Maecenas pellentesque erat vel tortor consectetur condimentum.
Nunc enim orci<<<---
euismod id nisi eget<<<---
interdum cursus ex.
Curabitur a dapibus tellus.
Lorem ipsum dolor sit amet<<<---
consectetur adipiscing elit.
Aliquam interdum mauris volutpat nisl placerat<<<---
et facilisis.
<<<---
谢谢。
"[space1]*,[space2]*"
是一个正则表达式,它匹配字符串:
零个或多个空格 (space1) 后跟一个逗号,然后是零个或多个空格 (space2)
第一个"*,[space]*"
是错误的,因为*
在正则表达式中有特殊含义。这意味着重复匹配的 group/character 零次或多次。不能放在最开始。
请注意,根据 POSIX,RS
被定义为单个字符,不是 正则表达式。
The first character of the string value of RS
shall be the input record separator; a <newline> by default. If RS
contains more than one character, the results are unspecified. If RS
is null, then records are separated by sequences consisting of a <newline> plus one or more blank lines, leading or trailing blank lines shall not result in empty records at the beginning or end of the input, and a <newline> shall always be a field separator, no matter what the value of FS
is.
source: Awk Posix standard
这意味着 RS=" *, *"
会导致 未定义的行为。
其他版本的 awk,实现了对 POSIX 的扩展,可能对 RS
所代表的含义有不同的处理方式。例如 GNU awk 和 mawk。两者都将 RS
实现为正则表达式,但两者的实现略有不同。 用法的总结是:
| RS | awk (posix) | gawk | mawk |
|------+--------------+------------------+------------------|
| "*" | "<asterisk>" | "<asterisk>" | "<asterisk>" |
| "*c" | undefined | "<asterisk>c" | undefined |
| "c*" | undefined | "","c","ccc",... | "","c","ccc",... |
c is any character
以上应该可以解释 OP 的错误,因为根据 mawk,RS="*, *"
是一个无效的正则表达式。
$ echo "abc" | ./mawk '/*c/'
mawk: line 1: regular expression compile failed (missing operand)
GNU awk: GNU awk 手册说明如下:
When using gawk
, the value of RS
is not limited to a one-character string. It can be any regular expression (see Regexp). (c.e.) In general, each record ends at the next string that matches the regular expression; the next record starts at the end of the matching string.
source: GNU awk manual
要了解 在 GNU awk 正则表达式中的用法,我们发现:
<asterisk> *
This symbol means that the preceding regular expression should be repeated as many times as necessary to find a match. For example, ph*
applies the *
symbol to the preceding h
and looks for matches of one p
followed by any number of h
s. This also matches just p
if no h
s are present.
There are two subtle points to understand how *
works. First, the *
applies only to the single preceding regular expression component (e.g., in ph*
, it applies just to the h
). To cause *
to apply to a larger subexpression, use parentheses: (ph)*
matches ph
, phph
, phphph
, and so on.
Second, *
finds as many repetitions as possible. If the text to be matched is phhhhhhhhhhhhhhooey
, ph*
matches all of the h
s.
source: GNU Regular expression operators
必须提到的是:
In POSIX awk and gawk, the *
, +
and ?
operators stand for themselves when there is nothing in the regexp that precedes them. For example, /+/
matches a literal plus sign. However, many other versions of awk treat such a usage as a syntax error.
source: GNU Regular expression operators
因此,设置 RS="*, *"
意味着它将匹配字符串 "*,"
、"*, "
、"*, "
、...
$ echo "a,b, c" | awk 'BEGIN{RS="*, *"}1'
a,b, c
$ echo "a*,b, c" | awk 'BEGIN{RS="*, *"}1'
a
b, c
mawk: GNU awk 手册说明如下:
12. Multi-line records
Since mawk
interprets RS
as a regular expression, multi-line records are easy.
source: man mawk
但是
11. Splitting strings, records and files
Awk programs use the same algorithm to split strings into arrays with
split()
, and records into fields on FS
. mawk uses essentially the same algorithm to split files into records on RS
.
Split(expr,A,sep)
works as follows:
- <snip>
- If
sep = " "
(a single space), then <SPACE> is trimmed from the front and back of expr
, and sep
becomes <SPACE>. mawk defines <SPACE> as the regular expression /[ \t\n]+/
. Otherwise sep
is treated as a regular expression, except that meta-characters
are ignored for a string of length 1, e.g., split(x, A, "*")
and split(x, A, /\*/)
are the same.
- <snip>
source: man mawk
手册没有提及应如何解释以元字符开头的正则表达式(例如“*c”)
注意: 在 GNU awk 部分,我删除了 POSIX awk,因为根据 POSIX,[=] 形式的正则表达式51=] 导致未定义的行为。 (这与定义 RS
无关,因为 RS
无论如何都不是 POSIX awk 中的 ERE)
The awk utility shall make use of the extended regular expression notation (see XBD Extended Regular Expressions)
source: Awk Posix standard
和
*+?{
The <asterisk>, <plus-sign>, <question-mark>, and <left-brace> shall be special except when used in a bracket expression (see RE Bracket Expression). Any of the following uses produce undefined results:
- If these characters appear first in an ERE, or immediately following an unescaped <vertical-line>, <circumflex>, <dollar-sign>, or <left-parenthesis>
- If a <left-brace> is not part of a valid interval expression (see EREs Matching Multiple Characters)
能否请您尝试关注一次。
awk '{gsub(", ","<<<---" ORS)} 1;END{print "<<<---"}' Input_file
我正在学习 awk 编程语言,但遇到了一个问题。
我有一个文件(awk.dat),内容如下:
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Maecenas pellentesque erat vel tortor consectetur condimentum.
Nunc enim orci, euismod id nisi eget, interdum cursus ex.
Curabitur a dapibus tellus.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Aliquam interdum mauris volutpat nisl placerat, et facilisis.
我正在使用以下命令:
awk 'BEGIN{RS="*, *";ORS="<<<---\n"} {print [=11=]}' awk.dat
它返回错误:
awk: run time error: regular expression compile failed (missing operand)
*, *
FILENAME="" FNR=0 NR=0
同时,如果我使用命令:awk 'BEGIN{RS=" *, *";ORS="<<<---\n"} {print [=14=]}' awk.dat
,它会给我所需的结果。
我需要了解这部分:RS=" *, *"
,双引号之间的 space 和 ,
之前的 *
的含义,因此它抛出了错误。
预期输出:
Lorem ipsum dolor sit amet<<<---
consectetur adipiscing elit.
Maecenas pellentesque erat vel tortor consectetur condimentum.
Nunc enim orci<<<---
euismod id nisi eget<<<---
interdum cursus ex.
Curabitur a dapibus tellus.
Lorem ipsum dolor sit amet<<<---
consectetur adipiscing elit.
Aliquam interdum mauris volutpat nisl placerat<<<---
et facilisis.
<<<---
谢谢。
"[space1]*,[space2]*"
是一个正则表达式,它匹配字符串:
零个或多个空格 (space1) 后跟一个逗号,然后是零个或多个空格 (space2)
第一个"*,[space]*"
是错误的,因为*
在正则表达式中有特殊含义。这意味着重复匹配的 group/character 零次或多次。不能放在最开始。
请注意,根据 POSIX,RS
被定义为单个字符,不是 正则表达式。
The first character of the string value of
RS
shall be the input record separator; a <newline> by default. IfRS
contains more than one character, the results are unspecified. IfRS
is null, then records are separated by sequences consisting of a <newline> plus one or more blank lines, leading or trailing blank lines shall not result in empty records at the beginning or end of the input, and a <newline> shall always be a field separator, no matter what the value ofFS
is.source: Awk Posix standard
这意味着 RS=" *, *"
会导致 未定义的行为。
其他版本的 awk,实现了对 POSIX 的扩展,可能对 RS
所代表的含义有不同的处理方式。例如 GNU awk 和 mawk。两者都将 RS
实现为正则表达式,但两者的实现略有不同。
| RS | awk (posix) | gawk | mawk |
|------+--------------+------------------+------------------|
| "*" | "<asterisk>" | "<asterisk>" | "<asterisk>" |
| "*c" | undefined | "<asterisk>c" | undefined |
| "c*" | undefined | "","c","ccc",... | "","c","ccc",... |
c is any character
以上应该可以解释 OP 的错误,因为根据 mawk,RS="*, *"
是一个无效的正则表达式。
$ echo "abc" | ./mawk '/*c/'
mawk: line 1: regular expression compile failed (missing operand)
GNU awk: GNU awk 手册说明如下:
When using
gawk
, the value ofRS
is not limited to a one-character string. It can be any regular expression (see Regexp). (c.e.) In general, each record ends at the next string that matches the regular expression; the next record starts at the end of the matching string.source: GNU awk manual
要了解
<asterisk>
*
This symbol means that the preceding regular expression should be repeated as many times as necessary to find a match. For example,ph*
applies the*
symbol to the precedingh
and looks for matches of onep
followed by any number ofh
s. This also matches justp
if noh
s are present.There are two subtle points to understand how
*
works. First, the*
applies only to the single preceding regular expression component (e.g., inph*
, it applies just to theh
). To cause*
to apply to a larger subexpression, use parentheses:(ph)*
matchesph
,phph
,phphph
, and so on.Second,
*
finds as many repetitions as possible. If the text to be matched isphhhhhhhhhhhhhhooey
,ph*
matches all of theh
s.source: GNU Regular expression operators
必须提到的是:
In
POSIX awkand gawk, the*
,+
and?
operators stand for themselves when there is nothing in the regexp that precedes them. For example,/+/
matches a literal plus sign. However, many other versions of awk treat such a usage as a syntax error.source: GNU Regular expression operators
因此,设置 RS="*, *"
意味着它将匹配字符串 "*,"
、"*, "
、"*, "
、...
$ echo "a,b, c" | awk 'BEGIN{RS="*, *"}1'
a,b, c
$ echo "a*,b, c" | awk 'BEGIN{RS="*, *"}1'
a
b, c
mawk: GNU awk 手册说明如下:
12. Multi-line records
Sincemawk
interpretsRS
as a regular expression, multi-line records are easy.source:
man mawk
但是
11. Splitting strings, records and files
Awk programs use the same algorithm to split strings into arrays withsplit()
, and records into fields onFS
. mawk uses essentially the same algorithm to split files into records onRS
.
Split(expr,A,sep)
works as follows:
- <snip>
- If
sep = " "
(a single space), then <SPACE> is trimmed from the front and back ofexpr
, andsep
becomes <SPACE>. mawk defines <SPACE> as the regular expression/[ \t\n]+/
. Otherwisesep
is treated as a regular expression, except that meta-characters are ignored for a string of length 1, e.g.,split(x, A, "*")
andsplit(x, A, /\*/)
are the same.- <snip>
source:
man mawk
手册没有提及应如何解释以元字符开头的正则表达式(例如“*c”)
注意: 在 GNU awk 部分,我删除了 POSIX awk,因为根据 POSIX,[=] 形式的正则表达式51=] 导致未定义的行为。 (这与定义 RS
无关,因为 RS
无论如何都不是 POSIX awk 中的 ERE)
The awk utility shall make use of the extended regular expression notation (see XBD Extended Regular Expressions)
source: Awk Posix standard
和
*+?{
The <asterisk>, <plus-sign>, <question-mark>, and <left-brace> shall be special except when used in a bracket expression (see RE Bracket Expression). Any of the following uses produce undefined results:
- If these characters appear first in an ERE, or immediately following an unescaped <vertical-line>, <circumflex>, <dollar-sign>, or <left-parenthesis>
- If a <left-brace> is not part of a valid interval expression (see EREs Matching Multiple Characters)
能否请您尝试关注一次。
awk '{gsub(", ","<<<---" ORS)} 1;END{print "<<<---"}' Input_file