localeCompare returns 0 用于不同的 unicode 符号

localeCompare returns 0 for different unicode symbols

我希望使用 localeCompare 对字符串进行严格排序,但我发现当给定两个不同的 unicode 字符时它返回 0,错误地表明它们相同,例如

ℜ U+211C (alt-08476) 黑字大写 R = 实部

ℝ U+211D (alt-08477) 双字大写 R = 实数集

"ℜ".localeCompare("ℝ", "en")   
> 0

"ℜ" === "ℝ"                    
> false

"ℜ".charCodeAt(0)
> 8476

"ℝ".charCodeAt(0)
> 8477

我查看了文档,但默认值已经用于“排序”和“变体”,这似乎是最严格的:

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Collator/Collator

难道localeCompare不能严格排序吗?

似乎在检测到它们都是大写字母R的非ASCII版本后,String.localeCompare()正确指定了两个字符之间没有特别的顺序区分。

console.log(
  // two non-0x43 uppercase Cs
  'ℂ'.localeCompare('', 'en'),

  // two non-0x5A uppercase Zs
  "ℤ".localeCompare('', 'en'),
  
  // 0x5A ASCII Z precedes both:
  "Z".localeCompare('ℤ', 'en'),
  "Z".localeCompare('', 'en'),
);

您可以在由于规范等价而没有定义排序顺序的地方使用 unicode 位置:

const sort = (a, b) => a.localeCompare(b) || -(a < b);

console.log(
  //  1 (C <  in localeCompare)
  sort('', 'C'),
  // -1 (Canonically equivalent; falls back to 0x2102 < 0xD835)
  sort('ℂ', '')  
);

来自ECMAScript spec

The actual return values are implementation-defined to permit implementers to encode additional information in the value, but the function is required to define a total ordering on all Strings and to return 0 when comparing Strings that are considered canonically equivalent by the Unicode standard.

来自维基百科关于 Unicode 等效性的文章:

Unicode provides two such notions, canonical equivalence and compatibility. Code point sequences that are defined as canonically equivalent are assumed to have the same appearance and meaning when printed or displayed.

For example, the code point U+006E (the Latin lowercase n) followed by U+0303 (the combining tilde ◌̃) is defined by Unicode to be canonically equivalent to the single code point U+00F1 (the lowercase letter ñ of the Spanish alphabet).

Therefore, those sequences should be displayed in the same manner, should be treated in the same way by applications such as alphabetizing names or searching, and may be substituted for each other. Similarly, each Hangul syllable block that is encoded as a single character may be equivalently encoded as a combination of a leading conjoining jamo, a vowel conjoining jamo, and, if appropriate, a trailing conjoining jamo.

另请参阅:https://unicode.org/reports/tr10/