为什么 GNU Diff 不理解 UTF-16(仅 UTF-8)?

Why does GNU Diff not understand UTF-16 (only UTF-8)?

为什么 GNU Diff 不理解 UTF-16(仅 UTF-8)?

此 GNU Diff 在 Git 中默认使用。

为什么这个错误没有得到修复?

BOM 是 Unicode 标准的一部分。 http://www.unicode.org/faq/utf_bom.html#bom4

为什么大多数程序员都忽略 BOM?

在Windows中,部分源文件默认使用UTF-16编码。

这在 GNU diffutils 文档的第 18.1.1 节中有解释 "Handling Multibyte and Varying-Width Characters":

diff, diff3 and sdiff treat each line of input as a string of unibyte characters. This can mishandle multibyte characters in some cases. For example, when asked to ignore spaces, diff does not properly ignore a multibyte space character.

Also, diff currently assumes that each byte is one column wide, and this assumption is incorrect in some locales, e.g., locales that use UTF-8 encoding. This causes problems with the -y or --side-by-side option of diff.

These problems need to be fixed without unduly affecting the performance of the utilities in unibyte environments.

The IBM GNU/Linux Technology Center Internationalization Team has proposed patches to support internationalized diff. Unfortunately, these patches are incomplete and are to an older version of diff, so more work needs to be done in this area.

它不能完全正确地处理 UTF-8,所以它不能处理 UTF-16 也就不足为奇了。

(您可以使用识别 UTF-16 的语言环境来缓解这个问题。我在我使用的任何系统上都没有这样的语言环境,包括 Windows 10 下的 Cygwin。)

我遇到的一个问题是 BOM 未被识别为文本。您可以通过使用 -a 选项来部分解决这个问题,该选项强制 diff 假定其输入文件是文本。当我将其与两个带有 BOM 和 Windows 样式行尾的小端 UTF-16 文本文件一起使用时,我得到:

$ diff hello.txt hello2.txt
Binary files hello.txt and hello2.txt differ
$ diff -a hello.txt hello2.txt 
1c1
< ��hello
---
> ��Hello
$

输出是 UTF-8/ASCII、UTF-16 和垃圾的混合。

(我怀疑潜在的原因是 UTF-16 相当特定于 Windows,而 GNU diffutils 的维护者不太关心 Windows。)

BOM 被大多数程序员忽略,因为 UTF-8 不需要它。

https://lists.gnu.org/archive/html/bug-diffutils/2018-04/msg00009.html

UTF-8 does not require BOM, but for UTF-16 and UTF-32 BOM is always present. Files with UTF-16 and UTF-32 without the BOM should be identified as binary.

But why there are no plans to support UTF-16 and UTF-32? Diff is part of the Git and is used all over the world. Now 2018 and Unicode solved problems with encodings.

https://lists.gnu.org/archive/html/bug-diffutils/2018-04/msg00011.html

why there are no plans to support UTF-16 and UTF-32?

没有人自愿这样做,也没有迫切需要。 UTF-16 和 UTF-32 主要用于内部表示,而不是文本文件。有关该主题的更多信息,请参阅:

http://utf8everywhere.org/