我应该使用 Python 套吗？

Question

最近阅读了忽略大小写时的大小写和字符串比较。我读过 MSDN 标准是使用 InvariantCulture 并且绝对避免使用 toLowercase。然而，从我读到的内容来看，casefold 更像是一个更具侵略性的 toLowercase。我的问题是我应该在 Python 中使用 casefold 还是有更多的 pythonic 标准来代替？另外，casefold 是否通过土耳其测试？

Answer 1

1）在Python3中，应该使用casefold()来实现不区分大小写的字符串匹配。

从 Python 3.0 开始，字符串以 Unicode 格式存储。 The Unicode Standard Chapter 3.13定义默认无大小写匹配如下：

A string X is a caseless match for a string Y if and only if:
toCasefold(X) = toCasefold(Y)

Python's casefold() implements the Unicode's toCasefold(). 所以，应该用它来实现不区分大小写的字符串匹配。虽然，单靠 casefolding 不足以涵盖一些极端案例并通过火鸡测试（见第 3 点）。

2) 从 Python 3.6 开始，casefold() 无法通过火鸡测试。

对于两个字符，大写 I 和带点的大写 I，the Unicode Standard defines two different casefolding mappings.

默认值（非突厥语）：
我 → 我 (U+0049 → U+0069)
İ → i̇ (U+0130 → U+0069 U+0307)

备选方案（针对突厥语系）：
I → ı (U+0049 → U+0131)
İ → i (U+0130 → U+0069)

Pythons casefold() 只能应用默认映射并且无法通过火鸡测试。例如，土耳其语单词“LİMANI”和“limanı”是不区分大小写的等价词，但是 "LİMANI".casefold() == "limanı".casefold() returns False。没有启用替代映射的选项。

3) 如何在Python 3.

中进行无大小写字符串匹配

The Unicode Standard Chapter 3.13 描述了几种不区分大小写的匹配算法。 规范的 casless 匹配 可能适合大多数用例。该算法已经考虑了所有极端情况。我们只需要添加一个选项，可以在非突厥语和突厥语 casefolding 之间切换。

import unicodedata

def normalize_NFD(string):
    return unicodedata.normalize('NFD', string)

def casefold_(string, include_special_i=False):
    if include_special_i:
        string = unicodedata.normalize('NFC', string)
        string = string.replace('\u0049', '\u0131')
        string = string.replace('\u0130', '\u0069')
    return string.casefold()

def casefold_NFD(string, include_special_i=False):
    return normalize_NFD(casefold_(normalize_NFD(string), include_special_i))

def caseless_match(string1, string2, include_special_i=False):
    return  casefold_NFD(string1, include_special_i) == casefold_NFD(string2, include_special_i)

casefold_() 是 Python 的 casefold() 的包装器。如果其参数 include_special_i 设置为 True，则应用突厥语映射，如果设置为 False，则使用默认映射。

caseless_match() 对 string1 和 string2 进行规范的 casless 匹配。如果字符串是突厥语单词，include_special_i参数必须设置为True.

示例：

>>> caseless_match('LİMANI', 'limanı', include_special_i=True)
True

>>> caseless_match('LİMANI', 'limanı')
False

>>> caseless_match('INTENSIVE', 'intensive', include_special_i=True)
False

>>> caseless_match('INTENSIVE', 'intensive')
True

我应该使用 Python 套吗？

Should I use Python casefold?

python

python-3.x

case-folding