toLowerCase 和 normalize 的顺序是否重要？

Question

我想忽略字符串之间的大小写差异和组成差异，所以我有

function normalize(text) {
    return text.normalize("NFD").toLowerCase();
}

它在Javascript中，但原则上应该不重要，问题是关于 Unicode 的。

给定

function normalize1(text) {
    return text.toLowerCase().normalize("NFD");
}

是否有 text1 和 text2 使得 normalize returns 对它们有相同的结果但 normalize1 没有，反之亦然？如果答案是 "yes"，那么这些规范化之一在某种意义上是 "more correct" 吗？

场景是我的程序维护一个短语列表，并且需要确定给定网页是否包含其中任何一个。最好有假阴性而不是假阳性，因为可以很容易地添加短语（这就是为什么我没有使用 NFKD 分解）。

第二个问题：normalize(text) 和 normalize1(text) 一开始就不同吗？如果不是，那么题主的答案也很明确"no".

Answer 1

您可能应该使用一种默认无大小写匹配算法，它使用大小写折叠而不是大小写映射。例如，请参阅 Unicode standard 中的以下引用，它定义了 规范的无大小写匹配 并部分回答了您的问题（重点是我的）：

In principle, normalization needs to be done after case folding, because case folding does not preserve the normalized form of strings in all instances. This requirement for normalization is covered in the following definition for canonical caseless matching:

D145 A string X is a canonical caseless match for a string Y if and only if:
NFD(toCasefold(NFD(X))) = NFD(toCasefold(NFD(Y)))

The invocations of canonical decomposition (NFD normalization) before case folding in D145 are to catch very infrequent edge cases. Normalization is not required before case folding, except for the character U+0345 ͅ combining greek ypogegrammeni and any characters that have it as part of their canonical decomposition, such as U+1FC3 ῃ greek small letter eta with ypogegrammeni. In practice, optimized versions of canonical caseless matching can catch these special cases, thereby avoiding an extra normalization step for each comparison.

但是如果您正在使用 Javascript，您可能会遇到大小写映射问题。如上所述，您应该始终在大小写转换后进行规范化，但我不确定在小写时是否需要边缘案例的预规范化步骤。如果你想安全起见，你甚至可以考虑：

function normalize(text) {
    return text.normalize("NFD").toLowerCase().normalize("NFD");
}

就是说，我无法举出 NFD 规范化和小写顺序很重要的示例（尽管 NFC 和其他大小写转换不同）。所以在实践中你可能对你问题中的两个函数中的任何一个都满意。

toLowerCase 和 normalize 的顺序是否重要？

Can ordering of toLowerCase and normalize matter?

unicode

unicode-normalization