为什么这个土耳其字符在我小写时会被损坏？

Question

我正在尝试将一些包含土耳其语字符的单词转换为小写。

从 utf-8 编码的文件中读取单词：

with open(filepath,'r', encoding='utf8') as f:
            text=f.read().lower()

当我尝试转换为小写时，土耳其语字符 İ 被损坏。但是，当我尝试转换为大写时，它工作正常。

示例代码如下：

str = 'İşbirliği'
print(str)
print(str.lower())

这是损坏后的样子：

这是怎么回事？

一些可能有用的信息：

我正在使用 Windows 10 cmd 提示符
Python 版本 3.6.0
chcp 设置为 65001

Answer 1

它没有损坏。

土耳其语既有带点的小写字母 i 也有无点的小写字母 ı，同样有带点的大写字母 İ 和无点的大写字母 I.

这在将点分大写 İ 转换为小写时提出了一个挑战：如何保留信息，如果需要将其转换回大写，则应将其转换回点分 İ?

Unicode 是这样解决这个问题的：当 İ 被转换为小写时，它实际上被转换为标准的拉丁文 i plus 组合字符 U+0307 "COMBINING DOT ABOVE"。您所看到的是您的终端无法正确呈现（或者更确切地说，不呈现）组合字符，并且与 Python.

无关

您可以使用 unicodedata.name():

看到这正在发生

>>> import unicodedata
>>> [unicodedata.name(c) for c in 'İ']
['LATIN CAPITAL LETTER I WITH DOT ABOVE']
>>> [unicodedata.name(c) for c in 'İ'.lower()]
['LATIN SMALL LETTER I', 'COMBINING DOT ABOVE']

...尽管在正常运行且配置正确的终端中，它会毫无问题地呈现：

>>> 'İ'.lower()
'i̇'

附带说明一下，如果您做将其转换回大写，它将保持分解形式：

>>> [unicodedata.name(c) for c in 'İ'.lower().upper()]
['LATIN CAPITAL LETTER I', 'COMBINING DOT ABOVE']

... 虽然你可以将它与 unicodedata.normalize() 重新组合：

>>> [unicodedata.name(c) for c in unicodedata.normalize('NFC','İ'.lower().upper())]
['LATIN CAPITAL LETTER I WITH DOT ABOVE']

有关详细信息，请参阅：

为什么这个土耳其字符在我小写时会被损坏？

Why is this Turkish character being corrupted when I lowercase it?

python

turkish

utf-8

case-sensitive

character-encoding