如何在 Python 3 中将阿拉伯字符转换为其基本字形形式?
How to convert arabic character to its base glyph form in Python 3?
由于单个阿拉伯字符可以采用多种字形形式,因此每种形式都有多种 unicode/utf-8 编码,例如 Aleph:Isolated == ا
与 utf-8==\xD8\xA7
、Final == ـا
与 utf-8==\xD9\x80\xD8\xA7
、Hamza == أ / إ
与 utf-8==\xD8\xA5 / \xD8\xA3
、Maddah == آ
与 utf-8==\xD8\xA2
、Maqsurah == ى
与 utf-8==\xD9\x89
,其中基本形式是带有 utf-8==\xD8\xA7
.
的孤立 aleph
如何将阿拉伯字符转换为 Python 3 中的基本字形形式?
您可以使用 unicodedata.normalize
将代码点转换为其分解形式,由一个基本字符和一个修饰符组成。它不适用于所有情况(特别是 Maqsurah),但可以帮助您编写一个函数来确定一些基本形式:
>>> s='ـا' # this character already consisted of the base code point.
>>> import unicodedata as ud
>>> for c in s:
... print(f'{c} U+{ord(c):04X} {ud.name(c)}')
...
ـ U+0640 ARABIC TATWEEL
ا U+0627 ARABIC LETTER ALEF
>>> s = 'أإآ' # These characters have decomposed forms
>>> for c in s:
... print(f'{c} U+{ord(c):04X} {ud.name(c)}')
...
أ U+0623 ARABIC LETTER ALEF WITH HAMZA ABOVE
إ U+0625 ARABIC LETTER ALEF WITH HAMZA BELOW
آ U+0622 ARABIC LETTER ALEF WITH MADDA ABOVE
>>> s = ud.normalize('NFD',s)
>>> for c in s:
... print(f'{c} U+{ord(c):04X} {ud.name(c)}')
...
ا U+0627 ARABIC LETTER ALEF
ٔ U+0654 ARABIC HAMZA ABOVE
ا U+0627 ARABIC LETTER ALEF
ٕ U+0655 ARABIC HAMZA BELOW
ا U+0627 ARABIC LETTER ALEF
ٓ U+0653 ARABIC MADDAH ABOVE
由于单个阿拉伯字符可以采用多种字形形式,因此每种形式都有多种 unicode/utf-8 编码,例如 Aleph:Isolated == ا
与 utf-8==\xD8\xA7
、Final == ـا
与 utf-8==\xD9\x80\xD8\xA7
、Hamza == أ / إ
与 utf-8==\xD8\xA5 / \xD8\xA3
、Maddah == آ
与 utf-8==\xD8\xA2
、Maqsurah == ى
与 utf-8==\xD9\x89
,其中基本形式是带有 utf-8==\xD8\xA7
.
如何将阿拉伯字符转换为 Python 3 中的基本字形形式?
您可以使用 unicodedata.normalize
将代码点转换为其分解形式,由一个基本字符和一个修饰符组成。它不适用于所有情况(特别是 Maqsurah),但可以帮助您编写一个函数来确定一些基本形式:
>>> s='ـا' # this character already consisted of the base code point.
>>> import unicodedata as ud
>>> for c in s:
... print(f'{c} U+{ord(c):04X} {ud.name(c)}')
...
ـ U+0640 ARABIC TATWEEL
ا U+0627 ARABIC LETTER ALEF
>>> s = 'أإآ' # These characters have decomposed forms
>>> for c in s:
... print(f'{c} U+{ord(c):04X} {ud.name(c)}')
...
أ U+0623 ARABIC LETTER ALEF WITH HAMZA ABOVE
إ U+0625 ARABIC LETTER ALEF WITH HAMZA BELOW
آ U+0622 ARABIC LETTER ALEF WITH MADDA ABOVE
>>> s = ud.normalize('NFD',s)
>>> for c in s:
... print(f'{c} U+{ord(c):04X} {ud.name(c)}')
...
ا U+0627 ARABIC LETTER ALEF
ٔ U+0654 ARABIC HAMZA ABOVE
ا U+0627 ARABIC LETTER ALEF
ٕ U+0655 ARABIC HAMZA BELOW
ا U+0627 ARABIC LETTER ALEF
ٓ U+0653 ARABIC MADDAH ABOVE