是否有 python 模块提供对等效 Unicode 字符的测试？

Question

有多个视觉上相似的Unicode字符，例如：

":" and "꞉"     U+A789
"?" and "？"    U+FF1F
"*" and "⁎"     U+204E
"'", "`", "‘", "’", and "ʻ"

还有带变音符和不带变音符的字符，如：

"c" and "ç"     U+00E7
"E" and "É"     U+00C9
"I" and "İ"     U+00ED
"i" and "ı"     U+0131

我想比较来自不同来源的文本，实际上相同的词比较相等，例如：

"naive" and "naïve"
"facade" and "façade"
"Hawai'i" and "Hawaiʻi"
"don't" and "don’t"
"letter 'A'" and "letter ‘A’" and "letter `A'"
"letter "B"" and "letter "“B”"

是否有提供此类字符之间等价性测试的模块？

import unicode
if x != y and not unicode.same_character(x, y):

Answer 1

如果你音译字符，你会接近那个。

在您提到的某些情况下（例如“i”和“ı”），它们不是同一个字符，但无论如何它都可以满足您的要求。

音译是特定于语言的。

试试这个模块：https://pypi.org/project/transliterate/

Answer 2

我建议阅读 规范化表格。 https://towardsdatascience.com/difference-between-nfd-nfc-nfkd-and-nfkc-explained-with-python-code-e2631f96ae6c
此外，这将帮助您使用 变音符号。 What is the best way to remove accents (normalize) in a Python unicode string?

可能这段代码可以满足您的需求。它以应用程序的形式编写。您可以在 2 个比较函数之间进行选择。你可以决定哪个是你的。

import unicodedata
import unidecode

if input("Ignore diacritical?\t") in ("y","yes"):
    norm_func= lambda s: unidecode.unidecode(unicodedata.normalize("NFKD",s))
else:
    norm_func= lambda s: unicodedata.normalize("NFKD", s)

same_characted= lambda c1,c2: norm_func(c1)==norm_func(c2)


#Test for Greek question mark.
print(same_characted("\x3b", "\u037E"))
print(same_characted(";",";"))
#Test for diacritical
print(same_characted("a", "ą"))
#Test for somethink else
print(same_characted("a", "b"))

是否有 python 模块提供对等效 Unicode 字符的测试？

Is there a python module that provides a test for equivalent Unicode characters?

python

unicode