How to find integer representing code point of special character? TypeError: ord() expected a character, but string of length 2 found

How to find integer representing code point of special character? TypeError: ord() expected a character, but string of length 2 found

我想计算几个不同编码的国内字符的整数表示码点(我敢肯定所有这些编解码器都包含这些字符。)。我的程序如下所示:

characters = ['Č', 'č', 'Š', 'š', 'Ž', 'ž']
codecs = ['iso8859_2', 'cp1250', 'mac_latin2', 'utf-8', 'utf_16_le', 'utf_16_be']

for letter in characters:
    for code in codecs:
        print(letter + ' ' + code + ' ' + str(ord(letter.encode(code))))

输出:

Č iso8859_2 200
Č cp1250 200
Traceback (most recent call last):
  File "C:/Users/Miha/Documents/2Semester/IK/Vaja2/chrEncode.py", line 7, in <module>
    print(letter + ' ' + code + ' ' + str(ord(letter.encode(code))))
TypeError: ord() expected a character, but string of length 2 found
Č mac_latin2 137

下一个注释代码片段可能会有所帮助:

characters = ['Č'] #, 'č', 'Š', 'š', 'Ž', 'ž']
codecs = ['iso8859_2', 'cp1250', 'mac_latin2', 'utf-8', 'utf_16_le', 'utf_16_be']

for letter in characters:
    for code in codecs:
        charenc = letter.encode(code)
        if len(charenc) == 1:
            charcod = str(ord(letter.encode(code)))
        else:
            charcod = '0x'   + ''.join('{:02X}'.format(charenc[i]) \
                                    for i in range(0,len(charenc)))
        print(  letter       + 
                ' U+'        + '{:04X}'.format(ord(letter)) + # Unicode codepoint (UCS-2)
                ' (='        + str(ord(letter))             + # detto in decimal
                '), length=' + str(len(charenc))            + # string length
                ' '          + charcod                      + # value
                ' in '       + code                         + # encoding 
                '')

输出:

D:\test\Python> python 37191263.py
Č U+010C (=268), length=1 200 in iso8859_2
Č U+010C (=268), length=1 200 in cp1250
Č U+010C (=268), length=1 137 in mac_latin2
Č U+010C (=268), length=2 0xC48C in utf-8
Č U+010C (=268), length=2 0x0C01 in utf_16_le
Č U+010C (=268), length=2 0x010C in utf_16_be

此处所有 utf-8utf_16_leutf_16_be 转换后的值都以十六进制打印,但将它们转换为 十进制 则不会有问题的任务,尽管恕我直言,小数似乎没有用。相反,在其他情况下,我也会将 all 转换为十六进制。

抱歉,如果我对您的剧本的改编看起来很小。
这是我的第一次 Python 会议,因为我安装并尝试它直到你的问题......感谢您对新奇体验的启发!

我发现类方法 int.from_bytes(bytes, byteorder, *, signed=False) 代替了 ord()。 代码:

characters = ['Č', 'č', 'Š', 'š', 'Ž', 'ž']
codecs = ['cp852', 'iso8859_2', 'cp1250', 'mac_latin2', 'utf-8', 'utf_16_le', 'utf_16_be']

for letter in characters:
    for codec in codecs:
        decCodePoint = int.from_bytes(letter.encode(codec), byteorder='big') #code point integer
        print(letter + ' ' + codec + ' ' + str(decCodePoint) + ' ' + str(hex(decCodePoint)) + ' ' + str(oct(decCodePoint))) #i also convert decimal integer to hexadecimal and octal

仅“Č”的输出:

Č cp852 172 0xac 0o254
Č iso8859_2 200 0xc8 0o310
Č cp1250 200 0xc8 0o310
Č mac_latin2 137 0x89 0o211
Č utf-8 50316 0xc48c 0o142214
Č utf_16_le 3073 0xc01 0o6001
Č utf_16_be 268 0x10c 0o414