相同的可见字符但不同的字节

Question

我有两个文件，每个文件都具有相同的 (Hindi) word but I copied the word for each file from different sources. While the words from both the sources are alike visually, their bytes are different. The files are here and here。我不确定这两种情况下的原始编码，但打开文件时 UTF-8 显示字符正确。

同样有趣的是，当我使用 uniq 实用程序执行唯一操作时，只返回一个条目，但是当我将它们放在一个文件中并在 vim 中对 u 进行排序时，我得到了两个条目。

请解释这是怎么回事。

更新：

如果您不想打开链接，Python 文字：'\u091c\u0941\u095c\n' 和 '\u091c\u0941\u0921\u093c\n' 并且单词看起来像

Answer 1

Vim 说：

:h :sort
The details about sorting depend on the library function used. There is no guarantee that sorting obeys the current locale. You will have to try it out.

同时 uniq（我想 gnu coreutils sort，而不是 vim 命令）是 unicode 感知的并且知道如何整理文本。

按ga或g8 在 vim 中的字符上分别查看构成单个字符的代码点或字节。

Answer 2

095C 是 DEVANAGARI LETTER DDDHA: ड़
0921 是 DEVANAGARI LETTER DDA: ड
093C 是 DEVANAGARI SIGN NUKTA（字符下方的点）：़

您可以在 Python 中看到它们是等价的（Python 此处为 3 语法）：

import unicodedata
unicodedata.normalize('NFC', '\u0921\u093c') == unicodedata.normalize('NFC', '\u095c')
# => True

您应该可以使用 :%!uconv -x any-nfc（安装了 ICU）或 :%!ruby -ne 'puts $_.unicode_normalize(:nfc)'（安装了 Ruby）来规范化您的文件。

相同的可见字符但不同的字节

Same visible character but different bytes

vim

text

utf-8

character-encoding

uniq