为什么 awk 不从行的中间删除 BOM?

Why awk does not remove BOM from the middle of a line?

我尝试使用 awk 从文件中删除所有字节顺序标记(我有很多这样的标记):

awk '{sub(/\xEF\xBB\xBF/,"")}{print}' f1.txt > f2.txt

似乎删除了行首的所有BOM,但中间的BOM没有删除。我可以通过以下方式验证:

grep -U $'\xEF\xBB\xBF' f2.txt

Grep returns me 一行,BOM在中间。

如前所述,sub() 只会换出最左边的子字符串,因此如果您要使用全局,则使用 gsub(),或者更好的方法是 gensub() .

sub(regexp, replacement [ target])

Search target, which is treated as a string, for the leftmost, longest substring matched by the regular expression regexp. Modify the entire string by replacing the matched text with replacement. The modified string becomes the new value of target. Return the number of substitutions made (zero or one).

gsub(regexp, replacement [ target])

Search target for all of the longest, leftmost, nonoverlapping matching substrings it can find and replace them with replacement. The ‘g’ in gsub() stands for “global,” which means replace everywhere.

gensub(regexp, replacement, how [ target]) #

Search the target string target for matches of the regular expression regexp. If how is a string beginning with ‘g’ or ‘G’ (short for “global”), then replace all matches of regexp with replacement. Otherwise, "how" is treated as a number indicating which match of regexp to replace. gensub() is a general substitution function. Its purpose is to provide more features than the standard sub() and gsub() functions.

下面链接了很多有用的信息和示例:

The GNU Awk User's Guide: String Functions / 9.1.3 String-Manipulation Functions