是否有 python 模块提供对等效 Unicode 字符的测试?
Is there a python module that provides a test for equivalent Unicode characters?
有多个视觉上相似的Unicode字符,例如:
":" and "꞉" U+A789
"?" and "?" U+FF1F
"*" and "⁎" U+204E
"'", "`", "‘", "’", and "ʻ"
还有带变音符和不带变音符的字符,如:
"c" and "ç" U+00E7
"E" and "É" U+00C9
"I" and "İ" U+00ED
"i" and "ı" U+0131
我想比较来自不同来源的文本,实际上相同的词比较相等,例如:
"naive" and "naïve"
"facade" and "façade"
"Hawai'i" and "Hawaiʻi"
"don't" and "don’t"
"letter 'A'" and "letter ‘A’" and "letter `A'"
"letter "B"" and "letter "“B”"
是否有提供此类字符之间等价性测试的模块?
import unicode
if x != y and not unicode.same_character(x, y):
如果你音译字符,你会接近那个。
在您提到的某些情况下(例如“i”和“ı”),它们不是同一个字符,但无论如何它都可以满足您的要求。
音译是特定于语言的。
我建议阅读 规范化表格。
https://towardsdatascience.com/difference-between-nfd-nfc-nfkd-and-nfkc-explained-with-python-code-e2631f96ae6c
此外,这将帮助您使用 变音符号。
What is the best way to remove accents (normalize) in a Python unicode string?
可能这段代码可以满足您的需求。它以应用程序的形式编写。您可以在 2 个比较函数之间进行选择。你可以决定哪个是你的。
import unicodedata
import unidecode
if input("Ignore diacritical?\t") in ("y","yes"):
norm_func= lambda s: unidecode.unidecode(unicodedata.normalize("NFKD",s))
else:
norm_func= lambda s: unicodedata.normalize("NFKD", s)
same_characted= lambda c1,c2: norm_func(c1)==norm_func(c2)
#Test for Greek question mark.
print(same_characted("\x3b", "\u037E"))
print(same_characted(";",";"))
#Test for diacritical
print(same_characted("a", "ą"))
#Test for somethink else
print(same_characted("a", "b"))
有多个视觉上相似的Unicode字符,例如:
":" and "꞉" U+A789
"?" and "?" U+FF1F
"*" and "⁎" U+204E
"'", "`", "‘", "’", and "ʻ"
还有带变音符和不带变音符的字符,如:
"c" and "ç" U+00E7
"E" and "É" U+00C9
"I" and "İ" U+00ED
"i" and "ı" U+0131
我想比较来自不同来源的文本,实际上相同的词比较相等,例如:
"naive" and "naïve"
"facade" and "façade"
"Hawai'i" and "Hawaiʻi"
"don't" and "don’t"
"letter 'A'" and "letter ‘A’" and "letter `A'"
"letter "B"" and "letter "“B”"
是否有提供此类字符之间等价性测试的模块?
import unicode
if x != y and not unicode.same_character(x, y):
如果你音译字符,你会接近那个。
在您提到的某些情况下(例如“i”和“ı”),它们不是同一个字符,但无论如何它都可以满足您的要求。
音译是特定于语言的。
我建议阅读 规范化表格。
https://towardsdatascience.com/difference-between-nfd-nfc-nfkd-and-nfkc-explained-with-python-code-e2631f96ae6c
此外,这将帮助您使用 变音符号。
What is the best way to remove accents (normalize) in a Python unicode string?
可能这段代码可以满足您的需求。它以应用程序的形式编写。您可以在 2 个比较函数之间进行选择。你可以决定哪个是你的。
import unicodedata
import unidecode
if input("Ignore diacritical?\t") in ("y","yes"):
norm_func= lambda s: unidecode.unidecode(unicodedata.normalize("NFKD",s))
else:
norm_func= lambda s: unicodedata.normalize("NFKD", s)
same_characted= lambda c1,c2: norm_func(c1)==norm_func(c2)
#Test for Greek question mark.
print(same_characted("\x3b", "\u037E"))
print(same_characted(";",";"))
#Test for diacritical
print(same_characted("a", "ą"))
#Test for somethink else
print(same_characted("a", "b"))