如何在 Python 中迭代 UTF-8？

Question

如何遍历 utf 8？

import string

for character in string.printable[1:]:
    print (character)

大概有针对 UTF-8 的类似方法？

Answer 1

Presumably there's a similar approach for UTF-8?

您想知道哪些代码点可以在 ascii 范围之外打印吗？或者你想要可打印字符的 utf8 编码？

获取所有 unicode 的所有可打印代码点：

unicode_max = 0x10ffff
printable_glyphs = [ chr(x) for x in range(0, unicode_max+1) if chr(x).isprintable() ]

上面说了utf8是一种编码。那就是将文本映射到特定字节，以便其他程序可以共享数据。

内存中的文本不是 utf8。每个 character/glyph 都有一个代码点。

正在转换为 utf-8

import unicodedata
monkey = unicodedata.lookup('monkey')

print(f"""
    glyph: {monkey}
    codepoint: Dec: {ord(monkey)}
    codepoint: Hex:  {hex(ord(monkey))}

    utf8: { monkey.encode('utf8', errors='strict') }
    utf16: { monkey.encode('utf16', errors='strict') }
    utf32: { monkey.encode('utf32', errors='strict') }
""")

输出：

glyph: 
codepoint: Dec: 128018
codepoint: Hex:  0x1f412

 utf8: b'\xf0\x9f\x90\x92'
utf16: b'\xff\xfe=\xd8\x12\xdc'
utf32: b'\xff\xfe\x00\x00\x12\xf4\x01\x00'

如何在 Python 中迭代 UTF-8？

how to iterate over UTF-8 in Python?

python

encoding

ascii

utf-8

character-encoding

获取所有 unicode 的所有可打印代码点：

正在转换为 utf-8