"sed" 中的范围运算符实际上是做什么的,它在 GNU/busybox 中被破坏了吗?
What does the range-operator in "sed" actually do, is it broken in GNU/busybox?
我想知道 "sed" 的 GNU 和 BusyBox 实现是否有问题。
我的默认 sed 实现是来自 GNU 的。
POSIX 说:
An editing command with two addresses shall select the inclusive
range from the first pattern space that matches the first address
through the next pattern space that matches the second.
但那为什么要给
$ { echo ha; echo ha; echo ha; } | sed '0,/ha/ !d'
ha
而不是
ha
ha
?很明显这里的第二个"ha"是匹配的"next"模式space,所以也应该输出!
但更奇怪的是,
$ { echo ha; echo ha; echo ha; } | busybox sed '0,/ha/ !d'
根本不输出任何东西!
但是即使 sed 会执行 POSIX 定义所说的操作,仍然不清楚在实际检查范围表达式时会发生什么。
是否每个范围条件都有自己的内部状态?或者 sed 脚本中的所有范围条件是否有一个全局状态?
显然,范围条件至少需要记住它当前是处于"search for a match of the first address"-状态还是处于"search for a match of the second address"-状态。也许它甚至需要记住第三种状态 "I have already processed the range and will not match again, no matter what".
更新这些条件当然很重要:每次读取新模式时 space?每次修改模式 space 时,比如通过 s 命令?或者只是如果控制流达到一个范围条件?
那么,它是什么?
在我了解更多之前,我会避免在我的 sed 脚本中使用范围条件,并将它们视为一个可疑的功能。
两个答案:
0
不是有效的 POSIX 地址(行从 1 开始计数)
0,/re/
是 GNU 扩展
GNU awk 手册页包括:
0,addr2
Start out in "matched first address" state, until addr2 is
found. This is similar to 1,addr2, except that if addr2 matches
the very first line of input the 0,addr2 form will be at the end
of its range, whereas the 1,addr2 form will still be at the
beginning of its range. This works only when addr2 is a regular
expression.
也许这有助于澄清:
$ { echo ha1; echo ha2; echo ha3; } | sed '0,/ha/ !d'
ha1
$ { echo ha1; echo ha2; echo ha3; } | sed '1,/ha/ !d'
ha1
ha2
$ { echo ha1; echo ha2; echo ha3; } | sed --posix '0,/ha/ !d'
sed: -e expression #1, char 8: invalid usage of line address 0
busybox 代码明确检查 addr1 是否大于 0,因此永远不会进入匹配状态。见 the busybox source code, line 1121:
|| (sed_cmd->beg_line > 0
- 每场比赛都保持自己的状态,因为多个比赛可以同时处于活动状态。
POSIX 说:
An editing command with two addresses shall select the inclusive range from the first pattern space that matches the first address through the next pattern space that matches the second. (If the second address is a number less than or equal to the line number first selected, only one line shall be selected.) Starting at the first line following the selected range, sed shall look again for the first address. Thereafter, the process shall be repeated.
每次遇到时都会进行测试:
$ { echo ..a; echo ..b; echo ..c; } |\
sed -n '
=;
y/cba/ba:/;
1 ,/b/ s/$/ 1/p;
/a/,/c/ s/$/ 2/p;
2, 3 s/$/ 3/p;
'
1
..: 1
2
..a 1
..a 1 2
..a 1 2 3
3
..b 1
..b 1 2
..b 1 2 3
例如,the busybox source code - 参见 sed_cmd_s
typedef。
我想知道 "sed" 的 GNU 和 BusyBox 实现是否有问题。
我的默认 sed 实现是来自 GNU 的。
POSIX 说:
An editing command with two addresses shall select the inclusive range from the first pattern space that matches the first address through the next pattern space that matches the second.
但那为什么要给
$ { echo ha; echo ha; echo ha; } | sed '0,/ha/ !d'
ha
而不是
ha
ha
?很明显这里的第二个"ha"是匹配的"next"模式space,所以也应该输出!
但更奇怪的是,
$ { echo ha; echo ha; echo ha; } | busybox sed '0,/ha/ !d'
根本不输出任何东西!
但是即使 sed 会执行 POSIX 定义所说的操作,仍然不清楚在实际检查范围表达式时会发生什么。
是否每个范围条件都有自己的内部状态?或者 sed 脚本中的所有范围条件是否有一个全局状态?
显然,范围条件至少需要记住它当前是处于"search for a match of the first address"-状态还是处于"search for a match of the second address"-状态。也许它甚至需要记住第三种状态 "I have already processed the range and will not match again, no matter what".
更新这些条件当然很重要:每次读取新模式时 space?每次修改模式 space 时,比如通过 s 命令?或者只是如果控制流达到一个范围条件?
那么,它是什么?
在我了解更多之前,我会避免在我的 sed 脚本中使用范围条件,并将它们视为一个可疑的功能。
两个答案:
0
不是有效的 POSIX 地址(行从 1 开始计数)0,/re/
是 GNU 扩展
GNU awk 手册页包括:
0,addr2
Start out in "matched first address" state, until addr2 is found. This is similar to 1,addr2, except that if addr2 matches the very first line of input the 0,addr2 form will be at the end of its range, whereas the 1,addr2 form will still be at the beginning of its range. This works only when addr2 is a regular expression.
也许这有助于澄清:
$ { echo ha1; echo ha2; echo ha3; } | sed '0,/ha/ !d'
ha1
$ { echo ha1; echo ha2; echo ha3; } | sed '1,/ha/ !d'
ha1
ha2
$ { echo ha1; echo ha2; echo ha3; } | sed --posix '0,/ha/ !d'
sed: -e expression #1, char 8: invalid usage of line address 0
busybox 代码明确检查 addr1 是否大于 0,因此永远不会进入匹配状态。见 the busybox source code, line 1121:
|| (sed_cmd->beg_line > 0
- 每场比赛都保持自己的状态,因为多个比赛可以同时处于活动状态。
POSIX 说:
An editing command with two addresses shall select the inclusive range from the first pattern space that matches the first address through the next pattern space that matches the second. (If the second address is a number less than or equal to the line number first selected, only one line shall be selected.) Starting at the first line following the selected range, sed shall look again for the first address. Thereafter, the process shall be repeated.
每次遇到时都会进行测试:
$ { echo ..a; echo ..b; echo ..c; } |\
sed -n '
=;
y/cba/ba:/;
1 ,/b/ s/$/ 1/p;
/a/,/c/ s/$/ 2/p;
2, 3 s/$/ 3/p;
'
1
..: 1
2
..a 1
..a 1 2
..a 1 2 3
3
..b 1
..b 1 2
..b 1 2 3
例如,the busybox source code - 参见 sed_cmd_s
typedef。