从字节流中读取utf-8字符

Question

给定字节流（生成器、文件等）我如何读取单个 utf-8 编码字符？

此操作必须消耗流中该字符的字节。
此操作不得消耗流中超过第一个字符的任何字节。
此操作对任何 Unicode 字符都应该成功。

我可以通过滚动我自己的 utf-8 解码函数来解决这个问题，但我不想重新发明轮子，因为我确信这个功能必须已经在别处使用来解析 utf-8 字符串。

Answer 1

用 encoding='utf8' 将流包装在 TextIOWrapper 中，然后对其调用 .read(1)。

这是假设您从 BufferedIOBase 或与其兼容的鸭子类型开始（即具有 read() 方法）。如果您有生成器或迭代器，您可能需要适配接口。

示例：

from io import TextIOWrapper

with open('/path/to/file', 'rb') as f:
  wf = TextIOWrapper(f, 'utf-8')
  wf._CHUNK_SIZE = 1  # Implementation detail, may not work everywhere

  wf.read(1) # gives next utf-8 encoded character
  f.read(1)  # gives next byte

从字节流中读取utf-8字符

Read utf-8 character from byte stream

utf-8

utf8-decode

python-3.x