尝试使用 iconv 将 US-ASCII 转换为 UTF-16LE 并获得不需要的输出

Question

我正在尝试将文件 System.Web.WebPages.Razor.dll.refresh 从 ASCII 转换为 UTF-16LE。当我在目录中的其他刷新文件上运行 file -i 命令时，我得到类似：

System.Web.Optimization.dll.refresh: text/plain; charset=utf-16le

当我运行它在我的目标文件上时，我得到：

System.Web.WebPages.Razor.dll.refresh: text/plain; charset=us-ascii

我认为这种编码差异导致我的构建管道出现错误，因此我尝试将此 ASCII 文件转换为 UTF-16LE，这样它就像其他刷新文件一样。但是，iconv 似乎没有给我我正在寻找的输出。

我的命令：

iconv -f US-ASCII -t UTF-16LE "System.Web.WebPages.Razor.dll.refresh" > "System.Web.WebPages.Razor.dll.refresh.new" && mv -f "System.Web.WebPages.Razor.dll.refresh.new" "System.Web.WebPages.Razor.dll.refresh"

输出有两个问题。

1) 它将文件隔开（即从 this 到 t h i s）。

2) 当我在这个新文件上运行 file -i 时，我得到以下输出：

System.Web.WebPages.Razor.dll.refresh: application/octet-stream; charset=binary

为什么我会得到这个二进制输出，为什么它会隔开文本？有没有更好的方法将此文件转换为正确的编码？

Answer 1

file 将您的新文件显示为二进制数据，因为它依赖前导 Byte Order Mark 来判断内容是否以 UTF-16 编码。当您指定字节序时，iconv 将省略 BOM：

$ iconv -f us-ascii -t utf16le <<<test | xxd
00000000: 7400 6500 7300 7400 0a00                 t.e.s.t...

但是，如果让它使用本机字节序（在典型的现代硬件上，99% 的时间都是 LE）：

$ iconv -f us-ascii -t utf16 <<<test | xxd
00000000: fffe 7400 6500 7300 7400 0a00            ..t.e.s.t...

标记在那里，file -i会报告为foo.txt: text/plain; charset=utf-16le。

我不知道有什么方法可以强制 iconv 始终添加具有显式 UTF-16 字节顺序的 BOM。相反，这是一个 perl 单行代码，它将转换为显式 UTF-16LE 并添加 BOM：

perl -0777 -pe 'BEGIN{binmode STDOUT,":encoding(utf16le)"; print "\x{FEFF}"}' in.txt > out.txt

或者使用 printf 打印 LE 编码的 BOM，并使用 iconv 打印其余部分：

(printf "\xFF\xFE"; iconv -f us-ascii -t utf-16le in.txt) > out.txt

尝试使用 iconv 将 US-ASCII 转换为 UTF-16LE 并获得不需要的输出

Trying to use iconv to convert US-ASCII to UTF-16LE and getting undesired output

bash

ascii

character-encoding

iconv

utf-16le