Shell、IFS、阅读和制表
Shell, IFS, read and tabulation
我试图在shell脚本中读取TSV文件,发现当IFS设置为制表时,读取跳过空值。一个例子胜过1000字:
$ echo -e "a\tb\tc" | while IFS=$'\t' read v1 v2 v3; do echo "$v1 - $v2 - $v3"; done
a - b - c
这按预期工作
$ echo -e "a\t\tc" | while IFS=$'\t' read v1 v2 v3; do echo "$v1 - $v2 - $v3"; done
a - c -
我希望 $v2 设置为空,$v3 设置为“c”
$ echo -e "a||c" | while IFS=$'|' read v1 v2 v3; do echo "$v1 - $v2 - $v3"; done
a - - c
与|作为分隔符,$v2 得到空值,$v3 得到值“c”,正如我所期望的那样。
有人对使用 | 时的不同行为有解释吗?还是\t?以及一种让 \t 表现得像 | 的方法?
Anyone has an explanation about the different behavior when using | or \t ?
来自 posix read:
The line shall be split into fields as in the shell (see Field Splitting); the first field shall be assigned to the first variable var, the second field to the second variable var, and so on. If there are fewer var operands specified than there are fields, the leftover fields and their intervening separators shall be assigned to the last var. If there are fewer fields than vars, the remaining vars shall be set to empty strings.
所以让我们去posix shell field splitting(强调我的):
The shell shall treat each character of the IFS as a delimiter and use the delimiters to split the results of parameter expansion and command substitution into fields.
- If the value of IFS is a
<space>
, <tab>
, and <newline>
, or if it is unset, ... [doesn't apply here]
- If the value of IFS is null, ... [also doesn't apply here]
- Otherwise, the following rules shall be applied in sequence. The term "IFS white space" is used to mean any sequence (zero or more instances) of white space characters that are in the IFS value (for example, if IFS contains
<space>
/ <comma>
/ <tab>
, any sequence of <space>
s and <tab>
s is considered IFS white space).
- IFS white space shall be ignored at the beginning and end of the input.
- Each occurrence in the input of an IFS character that is not IFS white space, along with any adjacent IFS white space, shall delimit a field, as described previously.
- Non-zero-length IFS white space shall delimit a field.
当 IFS
设置为任意空格组合时,这些空格在拆分字段时会连接在一起(即“非零长度”)。
所以 echo -e "a\t\tc" | IFS=$'\t' read v1 v2 v3
等于 echo -e "a\t\t\t\t\tc" | IFS=$'\t' read v1 v2 v3
。因为“字段比变量少”(2 对 3),v3
设置为空字符串。
但是当 IFS
设置为除空格之外的任何其他内容时,每个出现 的 IFS
字符都会拆分字段。
又一个有趣的极端情况,空白字符被特殊处理。
And a way to have \t behave like for | ?
在bash中,在阅读之前将其替换为独特的内容。我喜欢用 0x01
byte:
echo -e "a\t\tc" |
tr '\t' $'\x01' |
while IFS=$'\x01' read -r v1 v2 v3; do echo "$v1 - $v2 - $v3"; done
记得用read -r
.
我试图在shell脚本中读取TSV文件,发现当IFS设置为制表时,读取跳过空值。一个例子胜过1000字:
$ echo -e "a\tb\tc" | while IFS=$'\t' read v1 v2 v3; do echo "$v1 - $v2 - $v3"; done
a - b - c
这按预期工作
$ echo -e "a\t\tc" | while IFS=$'\t' read v1 v2 v3; do echo "$v1 - $v2 - $v3"; done
a - c -
我希望 $v2 设置为空,$v3 设置为“c”
$ echo -e "a||c" | while IFS=$'|' read v1 v2 v3; do echo "$v1 - $v2 - $v3"; done
a - - c
与|作为分隔符,$v2 得到空值,$v3 得到值“c”,正如我所期望的那样。
有人对使用 | 时的不同行为有解释吗?还是\t?以及一种让 \t 表现得像 | 的方法?
Anyone has an explanation about the different behavior when using | or \t ?
来自 posix read:
The line shall be split into fields as in the shell (see Field Splitting); the first field shall be assigned to the first variable var, the second field to the second variable var, and so on. If there are fewer var operands specified than there are fields, the leftover fields and their intervening separators shall be assigned to the last var. If there are fewer fields than vars, the remaining vars shall be set to empty strings.
所以让我们去posix shell field splitting(强调我的):
The shell shall treat each character of the IFS as a delimiter and use the delimiters to split the results of parameter expansion and command substitution into fields.
- If the value of IFS is a
<space>
,<tab>
, and<newline>
, or if it is unset, ... [doesn't apply here]- If the value of IFS is null, ... [also doesn't apply here]
- Otherwise, the following rules shall be applied in sequence. The term "IFS white space" is used to mean any sequence (zero or more instances) of white space characters that are in the IFS value (for example, if IFS contains
<space>
/<comma>
/<tab>
, any sequence of<space>
s and<tab>
s is considered IFS white space).
- IFS white space shall be ignored at the beginning and end of the input.
- Each occurrence in the input of an IFS character that is not IFS white space, along with any adjacent IFS white space, shall delimit a field, as described previously.
- Non-zero-length IFS white space shall delimit a field.
当 IFS
设置为任意空格组合时,这些空格在拆分字段时会连接在一起(即“非零长度”)。
所以 echo -e "a\t\tc" | IFS=$'\t' read v1 v2 v3
等于 echo -e "a\t\t\t\t\tc" | IFS=$'\t' read v1 v2 v3
。因为“字段比变量少”(2 对 3),v3
设置为空字符串。
但是当 IFS
设置为除空格之外的任何其他内容时,每个出现 的 IFS
字符都会拆分字段。
又一个有趣的极端情况,空白字符被特殊处理。
And a way to have \t behave like for | ?
在bash中,在阅读之前将其替换为独特的内容。我喜欢用 0x01
byte:
echo -e "a\t\tc" |
tr '\t' $'\x01' |
while IFS=$'\x01' read -r v1 v2 v3; do echo "$v1 - $v2 - $v3"; done
记得用read -r
.