为什么 GNU Diff 不理解 UTF-16(仅 UTF-8)?
Why does GNU Diff not understand UTF-16 (only UTF-8)?
为什么 GNU Diff 不理解 UTF-16(仅 UTF-8)?
此 GNU Diff 在 Git 中默认使用。
为什么这个错误没有得到修复?
BOM 是 Unicode 标准的一部分。 http://www.unicode.org/faq/utf_bom.html#bom4
为什么大多数程序员都忽略 BOM?
在Windows中,部分源文件默认使用UTF-16编码。
这在 GNU diffutils 文档的第 18.1.1 节中有解释 "Handling Multibyte and Varying-Width Characters":
diff
, diff3
and sdiff
treat each line of input as a string of unibyte
characters. This can mishandle multibyte characters in some cases. For
example, when asked to ignore spaces, diff
does not properly ignore a
multibyte space character.
Also, diff
currently assumes that each byte is one column wide, and
this assumption is incorrect in some locales, e.g., locales that use
UTF-8 encoding. This causes problems with the -y
or --side-by-side
option of diff
.
These problems need to be fixed without unduly affecting the
performance of the utilities in unibyte environments.
The IBM GNU/Linux Technology Center Internationalization Team has
proposed patches to support internationalized diff
. Unfortunately,
these patches are incomplete and are to an older version of diff
, so
more work needs to be done in this area.
它不能完全正确地处理 UTF-8,所以它不能处理 UTF-16 也就不足为奇了。
(您可以使用识别 UTF-16 的语言环境来缓解这个问题。我在我使用的任何系统上都没有这样的语言环境,包括 Windows 10 下的 Cygwin。)
我遇到的一个问题是 BOM 未被识别为文本。您可以通过使用 -a
选项来部分解决这个问题,该选项强制 diff
假定其输入文件是文本。当我将其与两个带有 BOM 和 Windows 样式行尾的小端 UTF-16 文本文件一起使用时,我得到:
$ diff hello.txt hello2.txt
Binary files hello.txt and hello2.txt differ
$ diff -a hello.txt hello2.txt
1c1
< ��hello
---
> ��Hello
$
输出是 UTF-8/ASCII、UTF-16 和垃圾的混合。
(我怀疑潜在的原因是 UTF-16 相当特定于 Windows,而 GNU diffutils 的维护者不太关心 Windows。)
BOM 被大多数程序员忽略,因为 UTF-8 不需要它。
https://lists.gnu.org/archive/html/bug-diffutils/2018-04/msg00009.html
UTF-8 does not require BOM, but for UTF-16 and UTF-32 BOM is always present. Files with UTF-16 and UTF-32 without the BOM should be identified as binary.
But why there are no plans to support UTF-16 and UTF-32? Diff is part of the Git and is used all over the world. Now 2018 and Unicode solved problems with encodings.
https://lists.gnu.org/archive/html/bug-diffutils/2018-04/msg00011.html
why there are no plans to support UTF-16 and UTF-32?
没有人自愿这样做,也没有迫切需要。 UTF-16 和 UTF-32 主要用于内部表示,而不是文本文件。有关该主题的更多信息,请参阅:
为什么 GNU Diff 不理解 UTF-16(仅 UTF-8)?
此 GNU Diff 在 Git 中默认使用。
为什么这个错误没有得到修复?
BOM 是 Unicode 标准的一部分。 http://www.unicode.org/faq/utf_bom.html#bom4
为什么大多数程序员都忽略 BOM?
在Windows中,部分源文件默认使用UTF-16编码。
这在 GNU diffutils 文档的第 18.1.1 节中有解释 "Handling Multibyte and Varying-Width Characters":
diff
,diff3
andsdiff
treat each line of input as a string of unibyte characters. This can mishandle multibyte characters in some cases. For example, when asked to ignore spaces,diff
does not properly ignore a multibyte space character.Also,
diff
currently assumes that each byte is one column wide, and this assumption is incorrect in some locales, e.g., locales that use UTF-8 encoding. This causes problems with the-y
or--side-by-side
option ofdiff
.These problems need to be fixed without unduly affecting the performance of the utilities in unibyte environments.
The IBM GNU/Linux Technology Center Internationalization Team has proposed patches to support internationalized
diff
. Unfortunately, these patches are incomplete and are to an older version ofdiff
, so more work needs to be done in this area.
它不能完全正确地处理 UTF-8,所以它不能处理 UTF-16 也就不足为奇了。
(您可以使用识别 UTF-16 的语言环境来缓解这个问题。我在我使用的任何系统上都没有这样的语言环境,包括 Windows 10 下的 Cygwin。)
我遇到的一个问题是 BOM 未被识别为文本。您可以通过使用 -a
选项来部分解决这个问题,该选项强制 diff
假定其输入文件是文本。当我将其与两个带有 BOM 和 Windows 样式行尾的小端 UTF-16 文本文件一起使用时,我得到:
$ diff hello.txt hello2.txt
Binary files hello.txt and hello2.txt differ
$ diff -a hello.txt hello2.txt
1c1
< ��hello
---
> ��Hello
$
输出是 UTF-8/ASCII、UTF-16 和垃圾的混合。
(我怀疑潜在的原因是 UTF-16 相当特定于 Windows,而 GNU diffutils 的维护者不太关心 Windows。)
BOM 被大多数程序员忽略,因为 UTF-8 不需要它。
https://lists.gnu.org/archive/html/bug-diffutils/2018-04/msg00009.html
UTF-8 does not require BOM, but for UTF-16 and UTF-32 BOM is always present. Files with UTF-16 and UTF-32 without the BOM should be identified as binary.
But why there are no plans to support UTF-16 and UTF-32? Diff is part of the Git and is used all over the world. Now 2018 and Unicode solved problems with encodings.
https://lists.gnu.org/archive/html/bug-diffutils/2018-04/msg00011.html
why there are no plans to support UTF-16 and UTF-32?
没有人自愿这样做,也没有迫切需要。 UTF-16 和 UTF-32 主要用于内部表示,而不是文本文件。有关该主题的更多信息,请参阅: