unicode table 关于 python 中字符的信息
unicode table information about a character in python
在 python 中有没有办法像在 Unicode table 中显示的那样获取给定字符的技术信息? (参照https://unicode-table.com/en/)
示例:
对于字母“ş”
- 名称 > 带双坟的拉丁文大写字母 E
- Unicode 编号 > U+0204
- HTML-代码 > ş
- Bloc > 拉丁语扩展 B
- 小写 > ş
我真正需要的是为任何 Unicode 数字(如此处的 U+0204)获取相应的名称(带有双坟的拉丁文大写字母 E)和小写版本(此处为“ş”)。
大致:
输入 = 一个 Unicode 数字
output = 对应信息
我能找到的最接近的东西是 fontTools 库,但我似乎找不到任何关于如何使用它来做到这一点的tutorial/documentation。
谢谢。
您可以通过某些方式做到这一点:
1- 自己创建一个 API(我找不到这样做的东西)
2- 在数据库或 excel 文件中创建 table
3- 加载并解析网站以执行此操作
我觉得第三种方法很简单。看看 This Page。你可以在那里找到一些信息 Unicodes。
unicodedata
documentation 展示了如何完成其中的大部分工作。
Unicode 块名称显然不可用,但 another Stack Overflow question has a solution of sorts and another has some additional approaches using regex。
uppercase/lowercase 映射和字符编号信息不是特定于 Unicode 的;只需使用常规 Python 字符串函数即可。
综上所述
>>> import unicodedata
>>> unicodedata.name('Ë')
'LATIN CAPITAL LETTER E WITH DIAERESIS'
>>> 'U+%04X' % ord('Ë')
'U+00CB'
>>> '&#%i;' % ord('Ë')
'Ë'
>>> 'Ë'.lower()
'ë'
U+%04X
格式在某种程度上是正确的,因为它只是避免了填充,并为值高于 65,535 的代码点打印了整个十六进制数。请注意,在这种情况下,某些其他格式需要使用 %08X
填充(特别是 Python 中的 \U00010000
格式)。
标准模块 unicodedata
defines a lot of properties, but not everything. A quick peek at its source 证实了这一点。
幸运的是unicodedata.txt
, the data file where this comes from, is not hard to parse. Each line consists of exactly 15 elements, ;
separated, which makes it ideal for parsing. Using the description of the elements on ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html,您可以创建一些classes 来封装数据。我从该列表中获取了 class 元素的名称;每个元素的含义都在同一页上进行了解释。
确保先下载 ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt and ftp://ftp.unicode.org/Public/UNIDATA/Blocks.txt,并将它们放在与该程序相同的文件夹中。
代码(使用 Python 2.7 和 3.6 测试):
# -*- coding: utf-8 -*-
class UnicodeCharacter:
def __init__(self):
self.code = 0
self.name = 'unnamed'
self.category = ''
self.combining = ''
self.bidirectional = ''
self.decomposition = ''
self.asDecimal = None
self.asDigit = None
self.asNumeric = None
self.mirrored = False
self.uc1Name = None
self.comment = ''
self.uppercase = None
self.lowercase = None
self.titlecase = None
self.block = None
def __getitem__(self, item):
return getattr(self, item)
def __repr__(self):
return '{'+self.name+'}'
class UnicodeBlock:
def __init__(self):
self.first = 0
self.last = 0
self.name = 'unnamed'
def __repr__(self):
return '{'+self.name+'}'
class BlockList:
def __init__(self):
self.blocklist = []
with open('Blocks.txt','r') as uc_f:
for line in uc_f:
line = line.strip(' \r\n')
if '#' in line:
line = line.split('#')[0].strip()
if line != '':
rawdata = line.split(';')
block = UnicodeBlock()
block.name = rawdata[1].strip()
rawdata = rawdata[0].split('..')
block.first = int(rawdata[0],16)
block.last = int(rawdata[1],16)
self.blocklist.append(block)
# make 100% sure it's sorted, for quicker look-up later
# (it is usually sorted in the file, but better make sure)
self.blocklist.sort (key=lambda x: block.first)
def lookup(self,code):
for item in self.blocklist:
if code >= item.first and code <= item.last:
return item.name
return None
class UnicodeList:
"""UnicodeList loads Unicode data from the external files
'UnicodeData.txt' and 'Blocks.txt', both available at unicode.org
These files must appear in the same directory as this program.
UnicodeList is a new interpretation of the standard library
'unicodedata'; you may first want to check if its functionality
suffices.
As UnicodeList loads its data from an external file, it does not depend
on the local build from Python (in which the Unicode data gets frozen
to the then 'current' version).
Initialize with
uclist = UnicodeList()
"""
def __init__(self):
# we need this first
blocklist = BlockList()
bpos = 0
self.codelist = []
with open('UnicodeData.txt','r') as uc_f:
for line in uc_f:
line = line.strip(' \r\n')
if '#' in line:
line = line.split('#')[0].strip()
if line != '':
rawdata = line.strip().split(';')
parsed = UnicodeCharacter()
parsed.code = int(rawdata[0],16)
parsed.characterName = rawdata[1]
parsed.category = rawdata[2]
parsed.combining = rawdata[3]
parsed.bidirectional = rawdata[4]
parsed.decomposition = rawdata[5]
parsed.asDecimal = int(rawdata[6]) if rawdata[6] else None
parsed.asDigit = int(rawdata[7]) if rawdata[7] else None
# the following value may contain a slash:
# ONE QUARTER ... 1/4
# let's make it Python 2.7 compatible :)
if '/' in rawdata[8]:
rawdata[8] = rawdata[8].replace('/','./')
parsed.asNumeric = eval(rawdata[8])
else:
parsed.asNumeric = int(rawdata[8]) if rawdata[8] else None
parsed.mirrored = rawdata[9] == 'Y'
parsed.uc1Name = rawdata[10]
parsed.comment = rawdata[11]
parsed.uppercase = int(rawdata[12],16) if rawdata[12] else None
parsed.lowercase = int(rawdata[13],16) if rawdata[13] else None
parsed.titlecase = int(rawdata[14],16) if rawdata[14] else None
while bpos < len(blocklist.blocklist) and parsed.code > blocklist.blocklist[bpos].last:
bpos += 1
parsed.block = blocklist.blocklist[bpos].name if bpos < len(blocklist.blocklist) and parsed.code >= blocklist.blocklist[bpos].first else None
self.codelist.append(parsed)
def find_code(self,codepoint):
"""Find the Unicode information for a codepoint (as int).
Returns:
a UnicodeCharacter class object or None.
"""
# the list is unlikely to contain duplicates but I have seen Unicode.org
# doing that in similar situations. Again, better make sure.
val = [x for x in self.codelist if codepoint == x.code]
return val[0] if val else None
def find_char(self,str):
"""Find the Unicode information for a codepoint (as character).
Returns:
for a single character: a UnicodeCharacter class object or
None.
for a multicharacter string: a list of the above, one element
per character.
"""
if len(str) > 1:
result = [self.find_code(ord(x)) for x in str]
return result
else:
return self.find_code(ord(str))
加载后,您现在可以使用
查找字符代码
>>> ul = UnicodeList() # ONLY NEEDED ONCE!
>>> print (ul.find_code(0x204))
{LATIN CAPITAL LETTER E WITH DOUBLE GRAVE}
默认情况下显示为字符的 name(Unicode 称其为 'code point'),但您也可以检索其他属性:
>>> print ('%04X' % uc.find_code(0x204).lowercase)
0205
>>> print (ul.lookup(0x204).block)
Latin Extended-B
并且(只要你没有得到 None
)甚至链接它们:
>>> print (ul.find_code(ul.find_code(0x204).lowercase))
{LATIN SMALL LETTER E WITH DOUBLE GRAVE}
它不依赖于您的特定构建 Python;您可以随时从 unicode.org 下载更新列表,并确保获得最新信息:
import unicodedata
>>> print (unicodedata.name('\U0001F903'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: no such name
>>> print (uclist.find_code(0x1f903))
{LEFT HALF CIRCLE WITH FOUR DOTS}
(使用 Python 3.5.3 测试。)
目前定义了两个查找函数:
find_code(int)
通过codepoint作为整数查找字符信息。
find_char(string)
查找 string
中字符的字符信息。如果只有一个字符,则returns一个UnicodeCharacter
对象;如果有更多,它 returns 一个 list 个对象。
在import unicodelist
之后(假设你保存为unicodelist.py
),你可以使用
>>> ul = UnicodeList()
>>> hex(ul.find_char(u'è').code)
'0xe8'
查找任何字符的十六进制代码,以及列表理解,例如
>>> l = [hex(ul.find_char(x).code) for x in 'Hello']
>>> l
['0x48', '0x65', '0x6c', '0x6c', '0x6f']
用于更长的字符串。 请注意,如果您只需要 hex 字符串表示,那么您实际上并不需要所有这些!这足够了:
l = [hex(ord(x)) for x in 'Hello']
此模块的目的是方便访问 其他 Unicode 属性。更长的例子:
str = 'Héllo...'
dest = ''
for i in str:
dest += chr(ul.find_char(i).uppercase) if ul.find_char(i).uppercase is not None else i
print (dest)
HÉLLO...
并根据您的示例显示角色的属性列表:
letter = u'Ȅ'
print ('Name > '+ul.find_char(letter).name)
print ('Unicode number > U+%04x' % ul.find_char(letter).code)
print ('Bloc > '+ul.find_char(letter).block)
print ('Lowercase > %s' % chr(ul.find_char(letter).lowercase))
(我遗漏了 HTML;这些名称未在 Unicode 标准中定义。)
在 python 中有没有办法像在 Unicode table 中显示的那样获取给定字符的技术信息? (参照https://unicode-table.com/en/)
示例: 对于字母“ş”
- 名称 > 带双坟的拉丁文大写字母 E
- Unicode 编号 > U+0204
- HTML-代码 > ş
- Bloc > 拉丁语扩展 B
- 小写 > ş
我真正需要的是为任何 Unicode 数字(如此处的 U+0204)获取相应的名称(带有双坟的拉丁文大写字母 E)和小写版本(此处为“ş”)。
大致:
输入 = 一个 Unicode 数字
output = 对应信息
我能找到的最接近的东西是 fontTools 库,但我似乎找不到任何关于如何使用它来做到这一点的tutorial/documentation。
谢谢。
您可以通过某些方式做到这一点:
1- 自己创建一个 API(我找不到这样做的东西)
2- 在数据库或 excel 文件中创建 table
3- 加载并解析网站以执行此操作
我觉得第三种方法很简单。看看 This Page。你可以在那里找到一些信息 Unicodes。
unicodedata
documentation 展示了如何完成其中的大部分工作。
Unicode 块名称显然不可用,但 another Stack Overflow question has a solution of sorts and another has some additional approaches using regex。
uppercase/lowercase 映射和字符编号信息不是特定于 Unicode 的;只需使用常规 Python 字符串函数即可。
综上所述
>>> import unicodedata
>>> unicodedata.name('Ë')
'LATIN CAPITAL LETTER E WITH DIAERESIS'
>>> 'U+%04X' % ord('Ë')
'U+00CB'
>>> '&#%i;' % ord('Ë')
'Ë'
>>> 'Ë'.lower()
'ë'
U+%04X
格式在某种程度上是正确的,因为它只是避免了填充,并为值高于 65,535 的代码点打印了整个十六进制数。请注意,在这种情况下,某些其他格式需要使用 %08X
填充(特别是 Python 中的 \U00010000
格式)。
标准模块 unicodedata
defines a lot of properties, but not everything. A quick peek at its source 证实了这一点。
幸运的是unicodedata.txt
, the data file where this comes from, is not hard to parse. Each line consists of exactly 15 elements, ;
separated, which makes it ideal for parsing. Using the description of the elements on ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html,您可以创建一些classes 来封装数据。我从该列表中获取了 class 元素的名称;每个元素的含义都在同一页上进行了解释。
确保先下载 ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt and ftp://ftp.unicode.org/Public/UNIDATA/Blocks.txt,并将它们放在与该程序相同的文件夹中。
代码(使用 Python 2.7 和 3.6 测试):
# -*- coding: utf-8 -*-
class UnicodeCharacter:
def __init__(self):
self.code = 0
self.name = 'unnamed'
self.category = ''
self.combining = ''
self.bidirectional = ''
self.decomposition = ''
self.asDecimal = None
self.asDigit = None
self.asNumeric = None
self.mirrored = False
self.uc1Name = None
self.comment = ''
self.uppercase = None
self.lowercase = None
self.titlecase = None
self.block = None
def __getitem__(self, item):
return getattr(self, item)
def __repr__(self):
return '{'+self.name+'}'
class UnicodeBlock:
def __init__(self):
self.first = 0
self.last = 0
self.name = 'unnamed'
def __repr__(self):
return '{'+self.name+'}'
class BlockList:
def __init__(self):
self.blocklist = []
with open('Blocks.txt','r') as uc_f:
for line in uc_f:
line = line.strip(' \r\n')
if '#' in line:
line = line.split('#')[0].strip()
if line != '':
rawdata = line.split(';')
block = UnicodeBlock()
block.name = rawdata[1].strip()
rawdata = rawdata[0].split('..')
block.first = int(rawdata[0],16)
block.last = int(rawdata[1],16)
self.blocklist.append(block)
# make 100% sure it's sorted, for quicker look-up later
# (it is usually sorted in the file, but better make sure)
self.blocklist.sort (key=lambda x: block.first)
def lookup(self,code):
for item in self.blocklist:
if code >= item.first and code <= item.last:
return item.name
return None
class UnicodeList:
"""UnicodeList loads Unicode data from the external files
'UnicodeData.txt' and 'Blocks.txt', both available at unicode.org
These files must appear in the same directory as this program.
UnicodeList is a new interpretation of the standard library
'unicodedata'; you may first want to check if its functionality
suffices.
As UnicodeList loads its data from an external file, it does not depend
on the local build from Python (in which the Unicode data gets frozen
to the then 'current' version).
Initialize with
uclist = UnicodeList()
"""
def __init__(self):
# we need this first
blocklist = BlockList()
bpos = 0
self.codelist = []
with open('UnicodeData.txt','r') as uc_f:
for line in uc_f:
line = line.strip(' \r\n')
if '#' in line:
line = line.split('#')[0].strip()
if line != '':
rawdata = line.strip().split(';')
parsed = UnicodeCharacter()
parsed.code = int(rawdata[0],16)
parsed.characterName = rawdata[1]
parsed.category = rawdata[2]
parsed.combining = rawdata[3]
parsed.bidirectional = rawdata[4]
parsed.decomposition = rawdata[5]
parsed.asDecimal = int(rawdata[6]) if rawdata[6] else None
parsed.asDigit = int(rawdata[7]) if rawdata[7] else None
# the following value may contain a slash:
# ONE QUARTER ... 1/4
# let's make it Python 2.7 compatible :)
if '/' in rawdata[8]:
rawdata[8] = rawdata[8].replace('/','./')
parsed.asNumeric = eval(rawdata[8])
else:
parsed.asNumeric = int(rawdata[8]) if rawdata[8] else None
parsed.mirrored = rawdata[9] == 'Y'
parsed.uc1Name = rawdata[10]
parsed.comment = rawdata[11]
parsed.uppercase = int(rawdata[12],16) if rawdata[12] else None
parsed.lowercase = int(rawdata[13],16) if rawdata[13] else None
parsed.titlecase = int(rawdata[14],16) if rawdata[14] else None
while bpos < len(blocklist.blocklist) and parsed.code > blocklist.blocklist[bpos].last:
bpos += 1
parsed.block = blocklist.blocklist[bpos].name if bpos < len(blocklist.blocklist) and parsed.code >= blocklist.blocklist[bpos].first else None
self.codelist.append(parsed)
def find_code(self,codepoint):
"""Find the Unicode information for a codepoint (as int).
Returns:
a UnicodeCharacter class object or None.
"""
# the list is unlikely to contain duplicates but I have seen Unicode.org
# doing that in similar situations. Again, better make sure.
val = [x for x in self.codelist if codepoint == x.code]
return val[0] if val else None
def find_char(self,str):
"""Find the Unicode information for a codepoint (as character).
Returns:
for a single character: a UnicodeCharacter class object or
None.
for a multicharacter string: a list of the above, one element
per character.
"""
if len(str) > 1:
result = [self.find_code(ord(x)) for x in str]
return result
else:
return self.find_code(ord(str))
加载后,您现在可以使用
查找字符代码>>> ul = UnicodeList() # ONLY NEEDED ONCE!
>>> print (ul.find_code(0x204))
{LATIN CAPITAL LETTER E WITH DOUBLE GRAVE}
默认情况下显示为字符的 name(Unicode 称其为 'code point'),但您也可以检索其他属性:
>>> print ('%04X' % uc.find_code(0x204).lowercase)
0205
>>> print (ul.lookup(0x204).block)
Latin Extended-B
并且(只要你没有得到 None
)甚至链接它们:
>>> print (ul.find_code(ul.find_code(0x204).lowercase))
{LATIN SMALL LETTER E WITH DOUBLE GRAVE}
它不依赖于您的特定构建 Python;您可以随时从 unicode.org 下载更新列表,并确保获得最新信息:
import unicodedata
>>> print (unicodedata.name('\U0001F903'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: no such name
>>> print (uclist.find_code(0x1f903))
{LEFT HALF CIRCLE WITH FOUR DOTS}
(使用 Python 3.5.3 测试。)
目前定义了两个查找函数:
find_code(int)
通过codepoint作为整数查找字符信息。find_char(string)
查找string
中字符的字符信息。如果只有一个字符,则returns一个UnicodeCharacter
对象;如果有更多,它 returns 一个 list 个对象。
在import unicodelist
之后(假设你保存为unicodelist.py
),你可以使用
>>> ul = UnicodeList()
>>> hex(ul.find_char(u'è').code)
'0xe8'
查找任何字符的十六进制代码,以及列表理解,例如
>>> l = [hex(ul.find_char(x).code) for x in 'Hello']
>>> l
['0x48', '0x65', '0x6c', '0x6c', '0x6f']
用于更长的字符串。 请注意,如果您只需要 hex 字符串表示,那么您实际上并不需要所有这些!这足够了:
l = [hex(ord(x)) for x in 'Hello']
此模块的目的是方便访问 其他 Unicode 属性。更长的例子:
str = 'Héllo...'
dest = ''
for i in str:
dest += chr(ul.find_char(i).uppercase) if ul.find_char(i).uppercase is not None else i
print (dest)
HÉLLO...
并根据您的示例显示角色的属性列表:
letter = u'Ȅ'
print ('Name > '+ul.find_char(letter).name)
print ('Unicode number > U+%04x' % ul.find_char(letter).code)
print ('Bloc > '+ul.find_char(letter).block)
print ('Lowercase > %s' % chr(ul.find_char(letter).lowercase))
(我遗漏了 HTML;这些名称未在 Unicode 标准中定义。)