为什么 GNU Awk 的 POSIX 模式在将 RS 设置为另一件事时不考虑换行字段?
How come the POSIX mode of GNU Awk does not consider a new line a field, when setting the RS to another thing?
我正在浏览 GNU Awk User's Guide and found this in the 4.1.1 Record Splitting with Standard awk 部分:
When using regular characters as the record separator, there is one unusual case that occurs when gawk is being fully POSIX-compliant (see section Command-Line Options). Then, the following (extreme) pipeline prints a surprising ‘1’:
$ echo | gawk --posix 'BEGIN { RS = "a" } ; { print NF }'
-| 1
There is one field, consisting of a newline. The value of the built-in variable NF is the number of fields in the current record. (In the normal case, gawk treats the newline as whitespace, printing ‘0’ as the result. Most other versions of awk also act this way.)
我检查了它,但它对我的 GNU Awk 5.0.0 不起作用:
$ gawk --version
GNU Awk 5.0.0, API: 2.0 (GNU MPFR 4.0.2, GNU MP 6.1.2)
$ echo | gawk --posix 'BEGIN { RS = "a" } ; { print NF }'
0
也就是说,行为与没有 POSIX 模式时完全相同:
$ echo | gawk 'BEGIN { RS = "a" } ; { print NF }'
0
我理解它的意思,其中当记录分隔符不是默认值时(即,它不是新行),仅新行的内容被视为一个字段。但是,我无法复制它。
我应该如何重现这个例子?我也试过 gawk --traditional
或 gawk -P
但结果总是 0.
因为我检查的 GNU Awk 用户指南是针对 5.1 版本的,而我有 5.0.0,所以我也检查了 an archived version for 5.0.0 并且它显示了相同的行,所以它是5.0 和 5.1 之间没有变化。
阅读POSIX标准时,我们发现:
The awk utility shall interpret each input record as a sequence of fields where, by default, a field is a string of non-<blank> non-<newline> characters. This default <blank> and <newline> field delimiter can be changed by using the FS
built-in variable
If FS
is <space>, skip leading and trailing <blank> and <newline> characters; fields shall be delimited by sets of one or more <blank> or <newline> characters.
话虽如此,正确的行为应该如下:
$ echo | awk 'BEGIN{RS="a"}{print NR,NF,length}'
1 0 1
- 单个记录:没有遇到 -字符
- 无字段:
FS
是默认值 space 因此所有前导和尾随 和 字符;被跳过
- 长度一:记录中只有一个字符。
定义FS
时,情况完全不同:
$ echo | awk 'BEGIN{FS="b";RS="a"}{print NR,NF,length}'
1 1 1
$ echo | awk 'BEGIN{FS="\n";RS="a"}{print NR,NF,length}'
1 2 1
结论:我认为 GNU awk 文档是错误的。
我正在浏览 GNU Awk User's Guide and found this in the 4.1.1 Record Splitting with Standard awk 部分:
When using regular characters as the record separator, there is one unusual case that occurs when gawk is being fully POSIX-compliant (see section Command-Line Options). Then, the following (extreme) pipeline prints a surprising ‘1’:
$ echo | gawk --posix 'BEGIN { RS = "a" } ; { print NF }' -| 1
There is one field, consisting of a newline. The value of the built-in variable NF is the number of fields in the current record. (In the normal case, gawk treats the newline as whitespace, printing ‘0’ as the result. Most other versions of awk also act this way.)
我检查了它,但它对我的 GNU Awk 5.0.0 不起作用:
$ gawk --version
GNU Awk 5.0.0, API: 2.0 (GNU MPFR 4.0.2, GNU MP 6.1.2)
$ echo | gawk --posix 'BEGIN { RS = "a" } ; { print NF }'
0
也就是说,行为与没有 POSIX 模式时完全相同:
$ echo | gawk 'BEGIN { RS = "a" } ; { print NF }'
0
我理解它的意思,其中当记录分隔符不是默认值时(即,它不是新行),仅新行的内容被视为一个字段。但是,我无法复制它。
我应该如何重现这个例子?我也试过 gawk --traditional
或 gawk -P
但结果总是 0.
因为我检查的 GNU Awk 用户指南是针对 5.1 版本的,而我有 5.0.0,所以我也检查了 an archived version for 5.0.0 并且它显示了相同的行,所以它是5.0 和 5.1 之间没有变化。
阅读POSIX标准时,我们发现:
The awk utility shall interpret each input record as a sequence of fields where, by default, a field is a string of non-<blank> non-<newline> characters. This default <blank> and <newline> field delimiter can be changed by using the
FS
built-in variableIf
FS
is <space>, skip leading and trailing <blank> and <newline> characters; fields shall be delimited by sets of one or more <blank> or <newline> characters.
话虽如此,正确的行为应该如下:
$ echo | awk 'BEGIN{RS="a"}{print NR,NF,length}'
1 0 1
- 单个记录:没有遇到 -字符
- 无字段:
FS
是默认值 space 因此所有前导和尾随和 字符;被跳过 - 长度一:记录中只有一个字符。
定义FS
时,情况完全不同:
$ echo | awk 'BEGIN{FS="b";RS="a"}{print NR,NF,length}'
1 1 1
$ echo | awk 'BEGIN{FS="\n";RS="a"}{print NR,NF,length}'
1 2 1
结论:我认为 GNU awk 文档是错误的。