获取拉丁字符的所有 unicode 变体
get all unicode variations of a latin character
例如,对于字符 "a"
,我想得到一个像 "aàáâãäåāăą"
这样的字符串(字符列表)(不确定该示例列表是否完整...)(基本上都是 unicode名称为 "Latin Small Letter A with *"
).
的字符
是否有通用的方法来获取它?
我要求 Python,但如果答案更笼统,这也很好,尽管无论如何我都希望有 Python 代码片段。 Python >=3.5 就可以了。但我猜你需要访问 Unicode 数据库,例如Python 模块 unicodedata
,与其他外部数据源相比,我更喜欢它。
我可以想象这样的解决方案:
def get_variations(char):
import unicodedata
name = unicodedata.name(char)
chars = char
for variation in ["WITH CEDILLA", "WITH MACRON", ...]:
try:
chars += unicodedata.lookup("%s %s" % (name, variation))
except KeyError:
pass
return chars
我知道 none,但是您可以自己构建一个。只需查找特殊字符的开始和结束编号。您可以使用 unicode character table 这样做。然后使用这些数字为每个字符创建一个列表:
ranges = {
'A': (192, 199),
'B': (0, 0),
'E': (200, 204),
...
}
map = {}
for char, rng in ranges.items():
start, end = rng
map[char] = char + ''.join([chr(i) for i in range(start, end)])
这将生成这样的地图:
{
'A': 'AÀÁÂÃÄÅÆ'
'B': 'B',
'E': 'EÈÉÊË',
...
}
首先,获取一组 Unicode 组合变音字符; they're contiguous, so this is pretty easy,例如:
# Unicode combining diacritical marks run from 768 to 879, inclusive
combining_chars = ''.join(map(chr, range(768, 880)))
现在定义一个函数,尝试用一个基本的 ASCII 字符组成每个字符;当组合范式长度为1时(意味着ASCII +组合成为单个Unicode序数),保存它:
import unicodedata
def get_unicode_variations(letter):
if len(letter) != 1:
raise ValueError("letter must be a single character to check for variations")
variations = []
# We could just loop over map(chr, range(768, 880)) without caching
# in combining_chars, but that increases runtime ~20%
for combiner in combining_chars:
normalized = unicodedata.normalize('NFKC', letter + combiner)
if len(normalized) == 1:
variations.append(normalized)
return ''.join(variations)
这样做的好处是不需要在 unicodedata
数据库中手动执行字符串查找,也不需要对组合字符的所有可能描述进行硬编码。包含单个字符的任何内容;在我的机器上检查的运行时间不到 50 µs,所以如果你不经常这样做,成本是合理的(如果你打算用相同的参数重复调用它,你可以用 functools.lru_cache
装饰并且想避免每次都重新计算它)。
如果您想从这些字符中获取所有 构建 的内容,更详尽的搜索可以找到它,但需要更长的时间(functools.lru_cache
几乎是强制性的,除非每个参数只调用一次):
import functools
import sys
import unicodedata
@functools.lru_cache(maxsize=None)
def get_unicode_variations_exhaustive(letter):
if len(letter) != 1:
raise ValueError("letter must be a single character to check for variations")
variations = []
for testlet in map(chr, range(sys.maxunicode)):
if letter in unicodedata.normalize('NFKD', testlet) and testlet != letter:
variations.append(testlet)
return ''.join(variations)
这将查找 任何 分解为包含目标字母的形式的字符;这确实意味着第一次搜索大约需要三分之一秒,结果包括的内容不仅仅是字符的修改版本(例如 'L'
的结果将包括 ℡
,这并不是真正的“修改后的 'L'
”,但它已经尽可能详尽了。
与unichars:
› unichars -a | grep -i 'Latin Small Letter A with'
à U+000E0 LATIN SMALL LETTER A WITH GRAVE
á U+000E1 LATIN SMALL LETTER A WITH ACUTE
â U+000E2 LATIN SMALL LETTER A WITH CIRCUMFLEX
ã U+000E3 LATIN SMALL LETTER A WITH TILDE
ä U+000E4 LATIN SMALL LETTER A WITH DIAERESIS
å U+000E5 LATIN SMALL LETTER A WITH RING ABOVE
ā U+00101 LATIN SMALL LETTER A WITH MACRON
ă U+00103 LATIN SMALL LETTER A WITH BREVE
ą U+00105 LATIN SMALL LETTER A WITH OGONEK
ǎ U+001CE LATIN SMALL LETTER A WITH CARON
ǟ U+001DF LATIN SMALL LETTER A WITH DIAERESIS AND MACRON
ǡ U+001E1 LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON
ǻ U+001FB LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE
ȁ U+00201 LATIN SMALL LETTER A WITH DOUBLE GRAVE
ȃ U+00203 LATIN SMALL LETTER A WITH INVERTED BREVE
ȧ U+00227 LATIN SMALL LETTER A WITH DOT ABOVE
ᶏ U+01D8F LATIN SMALL LETTER A WITH RETROFLEX HOOK
◌ᷲ U+01DF2 COMBINING LATIN SMALL LETTER A WITH DIAERESIS
ḁ U+01E01 LATIN SMALL LETTER A WITH RING BELOW
ẚ U+01E9A LATIN SMALL LETTER A WITH RIGHT HALF RING
ạ U+01EA1 LATIN SMALL LETTER A WITH DOT BELOW
ả U+01EA3 LATIN SMALL LETTER A WITH HOOK ABOVE
ấ U+01EA5 LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE
ầ U+01EA7 LATIN SMALL LETTER A WITH CIRCUMFLEX AND GRAVE
ẩ U+01EA9 LATIN SMALL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE
ẫ U+01EAB LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE
ậ U+01EAD LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW
ắ U+01EAF LATIN SMALL LETTER A WITH BREVE AND ACUTE
ằ U+01EB1 LATIN SMALL LETTER A WITH BREVE AND GRAVE
ẳ U+01EB3 LATIN SMALL LETTER A WITH BREVE AND HOOK ABOVE
ẵ U+01EB5 LATIN SMALL LETTER A WITH BREVE AND TILDE
ặ U+01EB7 LATIN SMALL LETTER A WITH BREVE AND DOT BELOW
ⱥ U+02C65 LATIN SMALL LETTER A WITH STROKE
可以直接使用Unicode数据库的decomposition mappings。以下代码检查以特定字母开头的分解字符的所有映射:
def get_unicode_variations(letter):
letter_code = ord(letter)
# For some characters, you might want to check all
# code points up to 0x10FFFF
for i in range(65536):
decomp = unicodedata.decomposition(chr(i))
# Mappings starting with '<...>' indicate a
# compatibility mapping (NFKD, NFKC) which we ignore.
while decomp != '' and not decomp.startswith('<'):
first_code = int(decomp.split()[0], 16)
if first_code == letter_code:
print(chr(i), unicodedata.name(chr(i)))
break
# Try to decompose further
decomp = unicodedata.decomposition(chr(first_code))
不过,如果您想处理多个字符,这会相当低效。对于字母 a
, the code above prints:
à LATIN SMALL LETTER A WITH GRAVE
á LATIN SMALL LETTER A WITH ACUTE
â LATIN SMALL LETTER A WITH CIRCUMFLEX
ã LATIN SMALL LETTER A WITH TILDE
ä LATIN SMALL LETTER A WITH DIAERESIS
å LATIN SMALL LETTER A WITH RING ABOVE
ā LATIN SMALL LETTER A WITH MACRON
ă LATIN SMALL LETTER A WITH BREVE
ą LATIN SMALL LETTER A WITH OGONEK
ǎ LATIN SMALL LETTER A WITH CARON
ǟ LATIN SMALL LETTER A WITH DIAERESIS AND MACRON
ǡ LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON
ǻ LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE
ȁ LATIN SMALL LETTER A WITH DOUBLE GRAVE
ȃ LATIN SMALL LETTER A WITH INVERTED BREVE
ȧ LATIN SMALL LETTER A WITH DOT ABOVE
ḁ LATIN SMALL LETTER A WITH RING BELOW
ạ LATIN SMALL LETTER A WITH DOT BELOW
ả LATIN SMALL LETTER A WITH HOOK ABOVE
ấ LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE
ầ LATIN SMALL LETTER A WITH CIRCUMFLEX AND GRAVE
ẩ LATIN SMALL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE
ẫ LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE
ậ LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW
ắ LATIN SMALL LETTER A WITH BREVE AND ACUTE
ằ LATIN SMALL LETTER A WITH BREVE AND GRAVE
ẳ LATIN SMALL LETTER A WITH BREVE AND HOOK ABOVE
ẵ LATIN SMALL LETTER A WITH BREVE AND TILDE
ặ LATIN SMALL LETTER A WITH BREVE AND DOT BELOW
例如,对于字符 "a"
,我想得到一个像 "aàáâãäåāăą"
这样的字符串(字符列表)(不确定该示例列表是否完整...)(基本上都是 unicode名称为 "Latin Small Letter A with *"
).
是否有通用的方法来获取它?
我要求 Python,但如果答案更笼统,这也很好,尽管无论如何我都希望有 Python 代码片段。 Python >=3.5 就可以了。但我猜你需要访问 Unicode 数据库,例如Python 模块 unicodedata
,与其他外部数据源相比,我更喜欢它。
我可以想象这样的解决方案:
def get_variations(char):
import unicodedata
name = unicodedata.name(char)
chars = char
for variation in ["WITH CEDILLA", "WITH MACRON", ...]:
try:
chars += unicodedata.lookup("%s %s" % (name, variation))
except KeyError:
pass
return chars
我知道 none,但是您可以自己构建一个。只需查找特殊字符的开始和结束编号。您可以使用 unicode character table 这样做。然后使用这些数字为每个字符创建一个列表:
ranges = {
'A': (192, 199),
'B': (0, 0),
'E': (200, 204),
...
}
map = {}
for char, rng in ranges.items():
start, end = rng
map[char] = char + ''.join([chr(i) for i in range(start, end)])
这将生成这样的地图:
{
'A': 'AÀÁÂÃÄÅÆ'
'B': 'B',
'E': 'EÈÉÊË',
...
}
首先,获取一组 Unicode 组合变音字符; they're contiguous, so this is pretty easy,例如:
# Unicode combining diacritical marks run from 768 to 879, inclusive
combining_chars = ''.join(map(chr, range(768, 880)))
现在定义一个函数,尝试用一个基本的 ASCII 字符组成每个字符;当组合范式长度为1时(意味着ASCII +组合成为单个Unicode序数),保存它:
import unicodedata
def get_unicode_variations(letter):
if len(letter) != 1:
raise ValueError("letter must be a single character to check for variations")
variations = []
# We could just loop over map(chr, range(768, 880)) without caching
# in combining_chars, but that increases runtime ~20%
for combiner in combining_chars:
normalized = unicodedata.normalize('NFKC', letter + combiner)
if len(normalized) == 1:
variations.append(normalized)
return ''.join(variations)
这样做的好处是不需要在 unicodedata
数据库中手动执行字符串查找,也不需要对组合字符的所有可能描述进行硬编码。包含单个字符的任何内容;在我的机器上检查的运行时间不到 50 µs,所以如果你不经常这样做,成本是合理的(如果你打算用相同的参数重复调用它,你可以用 functools.lru_cache
装饰并且想避免每次都重新计算它)。
如果您想从这些字符中获取所有 构建 的内容,更详尽的搜索可以找到它,但需要更长的时间(functools.lru_cache
几乎是强制性的,除非每个参数只调用一次):
import functools
import sys
import unicodedata
@functools.lru_cache(maxsize=None)
def get_unicode_variations_exhaustive(letter):
if len(letter) != 1:
raise ValueError("letter must be a single character to check for variations")
variations = []
for testlet in map(chr, range(sys.maxunicode)):
if letter in unicodedata.normalize('NFKD', testlet) and testlet != letter:
variations.append(testlet)
return ''.join(variations)
这将查找 任何 分解为包含目标字母的形式的字符;这确实意味着第一次搜索大约需要三分之一秒,结果包括的内容不仅仅是字符的修改版本(例如 'L'
的结果将包括 ℡
,这并不是真正的“修改后的 'L'
”,但它已经尽可能详尽了。
与unichars:
› unichars -a | grep -i 'Latin Small Letter A with'
à U+000E0 LATIN SMALL LETTER A WITH GRAVE
á U+000E1 LATIN SMALL LETTER A WITH ACUTE
â U+000E2 LATIN SMALL LETTER A WITH CIRCUMFLEX
ã U+000E3 LATIN SMALL LETTER A WITH TILDE
ä U+000E4 LATIN SMALL LETTER A WITH DIAERESIS
å U+000E5 LATIN SMALL LETTER A WITH RING ABOVE
ā U+00101 LATIN SMALL LETTER A WITH MACRON
ă U+00103 LATIN SMALL LETTER A WITH BREVE
ą U+00105 LATIN SMALL LETTER A WITH OGONEK
ǎ U+001CE LATIN SMALL LETTER A WITH CARON
ǟ U+001DF LATIN SMALL LETTER A WITH DIAERESIS AND MACRON
ǡ U+001E1 LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON
ǻ U+001FB LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE
ȁ U+00201 LATIN SMALL LETTER A WITH DOUBLE GRAVE
ȃ U+00203 LATIN SMALL LETTER A WITH INVERTED BREVE
ȧ U+00227 LATIN SMALL LETTER A WITH DOT ABOVE
ᶏ U+01D8F LATIN SMALL LETTER A WITH RETROFLEX HOOK
◌ᷲ U+01DF2 COMBINING LATIN SMALL LETTER A WITH DIAERESIS
ḁ U+01E01 LATIN SMALL LETTER A WITH RING BELOW
ẚ U+01E9A LATIN SMALL LETTER A WITH RIGHT HALF RING
ạ U+01EA1 LATIN SMALL LETTER A WITH DOT BELOW
ả U+01EA3 LATIN SMALL LETTER A WITH HOOK ABOVE
ấ U+01EA5 LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE
ầ U+01EA7 LATIN SMALL LETTER A WITH CIRCUMFLEX AND GRAVE
ẩ U+01EA9 LATIN SMALL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE
ẫ U+01EAB LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE
ậ U+01EAD LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW
ắ U+01EAF LATIN SMALL LETTER A WITH BREVE AND ACUTE
ằ U+01EB1 LATIN SMALL LETTER A WITH BREVE AND GRAVE
ẳ U+01EB3 LATIN SMALL LETTER A WITH BREVE AND HOOK ABOVE
ẵ U+01EB5 LATIN SMALL LETTER A WITH BREVE AND TILDE
ặ U+01EB7 LATIN SMALL LETTER A WITH BREVE AND DOT BELOW
ⱥ U+02C65 LATIN SMALL LETTER A WITH STROKE
可以直接使用Unicode数据库的decomposition mappings。以下代码检查以特定字母开头的分解字符的所有映射:
def get_unicode_variations(letter):
letter_code = ord(letter)
# For some characters, you might want to check all
# code points up to 0x10FFFF
for i in range(65536):
decomp = unicodedata.decomposition(chr(i))
# Mappings starting with '<...>' indicate a
# compatibility mapping (NFKD, NFKC) which we ignore.
while decomp != '' and not decomp.startswith('<'):
first_code = int(decomp.split()[0], 16)
if first_code == letter_code:
print(chr(i), unicodedata.name(chr(i)))
break
# Try to decompose further
decomp = unicodedata.decomposition(chr(first_code))
不过,如果您想处理多个字符,这会相当低效。对于字母 a
, the code above prints:
à LATIN SMALL LETTER A WITH GRAVE
á LATIN SMALL LETTER A WITH ACUTE
â LATIN SMALL LETTER A WITH CIRCUMFLEX
ã LATIN SMALL LETTER A WITH TILDE
ä LATIN SMALL LETTER A WITH DIAERESIS
å LATIN SMALL LETTER A WITH RING ABOVE
ā LATIN SMALL LETTER A WITH MACRON
ă LATIN SMALL LETTER A WITH BREVE
ą LATIN SMALL LETTER A WITH OGONEK
ǎ LATIN SMALL LETTER A WITH CARON
ǟ LATIN SMALL LETTER A WITH DIAERESIS AND MACRON
ǡ LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON
ǻ LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE
ȁ LATIN SMALL LETTER A WITH DOUBLE GRAVE
ȃ LATIN SMALL LETTER A WITH INVERTED BREVE
ȧ LATIN SMALL LETTER A WITH DOT ABOVE
ḁ LATIN SMALL LETTER A WITH RING BELOW
ạ LATIN SMALL LETTER A WITH DOT BELOW
ả LATIN SMALL LETTER A WITH HOOK ABOVE
ấ LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE
ầ LATIN SMALL LETTER A WITH CIRCUMFLEX AND GRAVE
ẩ LATIN SMALL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE
ẫ LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE
ậ LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW
ắ LATIN SMALL LETTER A WITH BREVE AND ACUTE
ằ LATIN SMALL LETTER A WITH BREVE AND GRAVE
ẳ LATIN SMALL LETTER A WITH BREVE AND HOOK ABOVE
ẵ LATIN SMALL LETTER A WITH BREVE AND TILDE
ặ LATIN SMALL LETTER A WITH BREVE AND DOT BELOW