How to find integer representing code point of special character? TypeError: ord() expected a character, but string of length 2 found
How to find integer representing code point of special character? TypeError: ord() expected a character, but string of length 2 found
我想计算几个不同编码的国内字符的整数表示码点(我敢肯定所有这些编解码器都包含这些字符。)。我的程序如下所示:
characters = ['Č', 'č', 'Š', 'š', 'Ž', 'ž']
codecs = ['iso8859_2', 'cp1250', 'mac_latin2', 'utf-8', 'utf_16_le', 'utf_16_be']
for letter in characters:
for code in codecs:
print(letter + ' ' + code + ' ' + str(ord(letter.encode(code))))
输出:
Č iso8859_2 200
Č cp1250 200
Traceback (most recent call last):
File "C:/Users/Miha/Documents/2Semester/IK/Vaja2/chrEncode.py", line 7, in <module>
print(letter + ' ' + code + ' ' + str(ord(letter.encode(code))))
TypeError: ord() expected a character, but string of length 2 found
Č mac_latin2 137
下一个注释代码片段可能会有所帮助:
characters = ['Č'] #, 'č', 'Š', 'š', 'Ž', 'ž']
codecs = ['iso8859_2', 'cp1250', 'mac_latin2', 'utf-8', 'utf_16_le', 'utf_16_be']
for letter in characters:
for code in codecs:
charenc = letter.encode(code)
if len(charenc) == 1:
charcod = str(ord(letter.encode(code)))
else:
charcod = '0x' + ''.join('{:02X}'.format(charenc[i]) \
for i in range(0,len(charenc)))
print( letter +
' U+' + '{:04X}'.format(ord(letter)) + # Unicode codepoint (UCS-2)
' (=' + str(ord(letter)) + # detto in decimal
'), length=' + str(len(charenc)) + # string length
' ' + charcod + # value
' in ' + code + # encoding
'')
输出:
D:\test\Python> python 37191263.py
Č U+010C (=268), length=1 200 in iso8859_2
Č U+010C (=268), length=1 200 in cp1250
Č U+010C (=268), length=1 137 in mac_latin2
Č U+010C (=268), length=2 0xC48C in utf-8
Č U+010C (=268), length=2 0x0C01 in utf_16_le
Č U+010C (=268), length=2 0x010C in utf_16_be
此处所有 utf-8
、utf_16_le
和 utf_16_be
转换后的值都以十六进制打印,但将它们转换为 十进制 则不会有问题的任务,尽管恕我直言,小数似乎没有用。相反,在其他情况下,我也会将 all 转换为十六进制。
抱歉,如果我对您的剧本的改编看起来很小。
这是我的第一次 Python 会议,因为我安装并尝试它直到你的问题......感谢您对新奇体验的启发!
我发现类方法 int.from_bytes(bytes, byteorder, *, signed=False)
代替了 ord()
。
代码:
characters = ['Č', 'č', 'Š', 'š', 'Ž', 'ž']
codecs = ['cp852', 'iso8859_2', 'cp1250', 'mac_latin2', 'utf-8', 'utf_16_le', 'utf_16_be']
for letter in characters:
for codec in codecs:
decCodePoint = int.from_bytes(letter.encode(codec), byteorder='big') #code point integer
print(letter + ' ' + codec + ' ' + str(decCodePoint) + ' ' + str(hex(decCodePoint)) + ' ' + str(oct(decCodePoint))) #i also convert decimal integer to hexadecimal and octal
仅“Č”的输出:
Č cp852 172 0xac 0o254
Č iso8859_2 200 0xc8 0o310
Č cp1250 200 0xc8 0o310
Č mac_latin2 137 0x89 0o211
Č utf-8 50316 0xc48c 0o142214
Č utf_16_le 3073 0xc01 0o6001
Č utf_16_be 268 0x10c 0o414
我想计算几个不同编码的国内字符的整数表示码点(我敢肯定所有这些编解码器都包含这些字符。)。我的程序如下所示:
characters = ['Č', 'č', 'Š', 'š', 'Ž', 'ž']
codecs = ['iso8859_2', 'cp1250', 'mac_latin2', 'utf-8', 'utf_16_le', 'utf_16_be']
for letter in characters:
for code in codecs:
print(letter + ' ' + code + ' ' + str(ord(letter.encode(code))))
输出:
Č iso8859_2 200
Č cp1250 200
Traceback (most recent call last):
File "C:/Users/Miha/Documents/2Semester/IK/Vaja2/chrEncode.py", line 7, in <module>
print(letter + ' ' + code + ' ' + str(ord(letter.encode(code))))
TypeError: ord() expected a character, but string of length 2 found
Č mac_latin2 137
下一个注释代码片段可能会有所帮助:
characters = ['Č'] #, 'č', 'Š', 'š', 'Ž', 'ž']
codecs = ['iso8859_2', 'cp1250', 'mac_latin2', 'utf-8', 'utf_16_le', 'utf_16_be']
for letter in characters:
for code in codecs:
charenc = letter.encode(code)
if len(charenc) == 1:
charcod = str(ord(letter.encode(code)))
else:
charcod = '0x' + ''.join('{:02X}'.format(charenc[i]) \
for i in range(0,len(charenc)))
print( letter +
' U+' + '{:04X}'.format(ord(letter)) + # Unicode codepoint (UCS-2)
' (=' + str(ord(letter)) + # detto in decimal
'), length=' + str(len(charenc)) + # string length
' ' + charcod + # value
' in ' + code + # encoding
'')
输出:
D:\test\Python> python 37191263.py
Č U+010C (=268), length=1 200 in iso8859_2
Č U+010C (=268), length=1 200 in cp1250
Č U+010C (=268), length=1 137 in mac_latin2
Č U+010C (=268), length=2 0xC48C in utf-8
Č U+010C (=268), length=2 0x0C01 in utf_16_le
Č U+010C (=268), length=2 0x010C in utf_16_be
此处所有 utf-8
、utf_16_le
和 utf_16_be
转换后的值都以十六进制打印,但将它们转换为 十进制 则不会有问题的任务,尽管恕我直言,小数似乎没有用。相反,在其他情况下,我也会将 all 转换为十六进制。
抱歉,如果我对您的剧本的改编看起来很小。
这是我的第一次 Python 会议,因为我安装并尝试它直到你的问题......感谢您对新奇体验的启发!
我发现类方法 int.from_bytes(bytes, byteorder, *, signed=False)
代替了 ord()
。
代码:
characters = ['Č', 'č', 'Š', 'š', 'Ž', 'ž']
codecs = ['cp852', 'iso8859_2', 'cp1250', 'mac_latin2', 'utf-8', 'utf_16_le', 'utf_16_be']
for letter in characters:
for codec in codecs:
decCodePoint = int.from_bytes(letter.encode(codec), byteorder='big') #code point integer
print(letter + ' ' + codec + ' ' + str(decCodePoint) + ' ' + str(hex(decCodePoint)) + ' ' + str(oct(decCodePoint))) #i also convert decimal integer to hexadecimal and octal
仅“Č”的输出:
Č cp852 172 0xac 0o254
Č iso8859_2 200 0xc8 0o310
Č cp1250 200 0xc8 0o310
Č mac_latin2 137 0x89 0o211
Č utf-8 50316 0xc48c 0o142214
Č utf_16_le 3073 0xc01 0o6001
Č utf_16_be 268 0x10c 0o414