在哪些情况下 normalize('NFKC') 方法有效?

In which cases normalize('NFKC') method work?

我尝试对不同的字符使用 normalize('NFKC') 方法,但没有用。幸运的是,不能为 NFC 说这个。如果可能,normalize('NFC') 总是用单个代码点替换多个代码点。例如:

let t1 = `\u00F4`; //ô
let t2 = `\u006F\u0302`; //ô
console.log(t2.normalize('NFC') == t1); //true

下面是 NFKC 永远不起作用的示例:

let s1 = '\uFB00'; //"ff"
let s2 = '\u0066\u0066'; //"ff"
console.log(s2.normalize('NFKC') == s1); //false

我之前认为 NFKC 将多个代码点替换为表示兼容字符的单个代码点。简单来说,我认为 NFKC 会将 \u0066\u0066 替换为 \uFB00

如果 NFKC 不是这样工作的,那么...它是如何工作的?

问题是 NFKC(以及 NFKD)支持兼容和规范等效的规范化。

Unicode

The type of full decomposition chosen depends on which Unicode Normalization Form is involved. For NFC or NFD, one does a full canonical decomposition, which makes use of only canonical Decomposition_Mapping values. For NFKC or NFKD, one does a full compatibility decomposition, which makes use of canonical and compatibility Decomposition_Mapping values.

这是完全可以理解的,因为正如 MDN 所说:

All canonically equivalent sequences are also compatible, but not vice versa.

但还值得注意的是 NFKC 以不同的方式进行兼容和规范等效的规范化。 NFKC 的规范等效归一化与 NFC 的生成方式相同。例如:

//"ô" (U+00F4) -> "a" (U+006F) + " ̂" (U+0302) -> "â" (U+00F4)
let c1 = `\u006F\u0302`; //ô
console.log(c1.normalize('NFKC').length); //1

但此参数的兼容规范化工作方式不同。 spec 表示:

Normalization Form KC does not attempt to map character sequences to compatibility composites. For example, a compatibility composition of “office” does not produce “o\uFB03ce”, even though “\uFB03” is a character that is the compatibility equivalent of the sequence of three characters “ffi”. In other words, the composition phase of NFC and NFKC are the same—only their decomposition phase differs, with NFKC applying compatibility decompositions.

例如:

//"ff"(U+FB00) -> "f"(U+0066) + "i"(U+0066) -> "f"(U+0066) + "i"(U+0066)
let c2 = '\u0066\u0066'; //ff
console.log(c2.normalize('NFKC').length); //2