如何"normalize" python 3 unicode字符串
How to "normalize" python 3 unicode string
我需要比较两个字符串。 aa
是从PDF文件中提取出来的(使用pdfminer/chardet),bb
是键盘输入。如何规范化第一个字符串以进行比较?
>>> aa = "ā"
>>> bb = "ā"
>>> aa == bb
False
>>>
>>> aa.encode('utf-8')
b'\xc4\x81'
>>> bb.encode('utf-8')
b'a\xcc\x84'
你用unicodedata.normalize标准化:
>>> aa = b'\xc4\x81'.decode('utf8') # composed form
>>> bb = b'a\xcc\x84'.decode('utf8') # decomposed form
>>> aa
'ā'
>>> bb
'ā'
>>> aa == bb
False
>>> import unicodedata as ud
>>> aa == ud.normalize('NFC',bb) # compare composed
True
>>> ud.normalize('NFD',aa) == bb # compare decomposed
True
我需要比较两个字符串。 aa
是从PDF文件中提取出来的(使用pdfminer/chardet),bb
是键盘输入。如何规范化第一个字符串以进行比较?
>>> aa = "ā"
>>> bb = "ā"
>>> aa == bb
False
>>>
>>> aa.encode('utf-8')
b'\xc4\x81'
>>> bb.encode('utf-8')
b'a\xcc\x84'
你用unicodedata.normalize标准化:
>>> aa = b'\xc4\x81'.decode('utf8') # composed form
>>> bb = b'a\xcc\x84'.decode('utf8') # decomposed form
>>> aa
'ā'
>>> bb
'ā'
>>> aa == bb
False
>>> import unicodedata as ud
>>> aa == ud.normalize('NFC',bb) # compare composed
True
>>> ud.normalize('NFD',aa) == bb # compare decomposed
True