在 python 原始文件 IO 中实现一致的块大小

Question

前面的问题：

标准库中是否有使用 for ... in ... 语法（即 __iter__/__next__）解析原始二进制文件的 pythonic 方法，该方法生成尊重 [=18 的块=]参数，无需继承IOBase或其子类类?

详细解释

我想打开一个原始文件进行解析，使用 for ... in ... 语法，我希望该语法生成可预测形状的对象。对于我正在处理的问题，这并没有按预期发生，所以我尝试了以下测试（需要import numpy as np）：

In [271]: with open('tinytest.dat', 'wb') as f:
     ...:     f.write(np.random.randint(0, 256, 16384, dtype=np.uint8).tobytes())
     ...:

In [272]: np.array([len(b) for b in open('tinytest.dat', 'rb', 16)])
Out[272]:
array([  13,  138,  196,  263,  719,   98,  476,    3,  266,   63,   51,
    241,  472,   75,  120,  137,   14,  342,  148,  399,  366,  360,
     41,    9,  141,  282,    7,  159,  341,  355,  470,  427,  214,
     42, 1095,   84,  284,  366,  117,  187,  188,   54,  611,  246,
    743,  194,   11,   38,  196, 1368,    4,   21,  442,  169,   22,
    207,  226,  227,  193,  677,  174,  110,  273,   52,  357])

我不明白为什么会出现这种随机行为，以及为什么它不尊重 buffersize 论点。使用 read1 给出了预期的字节数：

In [273]: with open('tinytest.dat', 'rb', 16) as f:
     ...:     b = f.read1()
     ...:     print(len(b))
     ...:     print(b)
     ...:
16
b'M\xfb\xea\xc0X\xd4U%3\xad\xc9u\n\x0f8}'

它就在那里：第一个块末尾附近的换行符。

In [274]: with open('tinytest.dat', 'rb', 2048) as f:
     ...:     print(f.readline())
     ...:
b'M\xfb\xea\xc0X\xd4U%3\xad\xc9u\n'

果然，readline 被调用以生成文件的每个块，并且它被换行值（对应于 10）绊倒了。我通过 definition of IOBase:

中的代码行验证了这一阅读

571    def __next__(self):
572    line = self.readline()
573    if not line:
574        raise StopIteration
575    return line

所以我的问题是：是否有更多 pythonic 方法来实现 buffersize-尊重原始文件行为，允许 for ... in ... 语法，没有子类 IOBase 或其子类类（因此，不属于标准库）？如果不是，这种意外行为是否需要 PEP？（或者它是否保证学习预期行为？:)

Answer 1

这种行为并不意外，据记载，从 IOBase 派生的所有对象都迭代了行。二进制模式与文本模式之间唯一不同的是行终止符的定义方式，它在二进制模式下始终定义为 b"\n"。

docs:

IOBase (and its subclasses) supports the iterator protocol, meaning that an IOBase object can be iterated over yielding the lines in a stream. Lines are defined slightly differently depending on whether the stream is a binary stream (yielding bytes), or a text stream (yielding character strings). See readline() below.

问题在于，过去在类型系统中文本和二进制数据之间存在歧义，这是 Python 2 -> 3 转换打破向后兼容性的主要推动因素。

我认为让迭代器协议尊重 Python 中以二进制模式打开的文件对象的缓冲区大小肯定合理 3. 为什么决定保持旧的行为是我只能推测的事情。

无论如何，您应该只定义自己的迭代器，这在 Python 中很常见。迭代器是一个基本的构建块，就像内置类型一样。

你实际上可以使用 2-argument iter(callable, sentinel) 形式来构造一个超级基本的包装器：

>>> from functools import partial
>>> def iter_blocks(f, n):
...     return iter(partial(f.read, n), b'')
...
>>> np.array([len(b) for b in iter_blocks(open('tinytest.dat', 'rb'), 16)])
array([16, 16, 16, ..., 16, 16, 16])

当然，您可以只使用生成器：

def iter_blocks(bin_file, n):
    result = bin_file.read(n)
    while result:
        yield result
        result = bin_file.read(n)

有很多方法可以解决这个问题。同样，迭代器是一种核心类型，用于编写惯用的Python。

Python 是一种非常动态的语言，“鸭子打字”是游戏的名称。通常，您的第一直觉不应该是“如何子类化某些内置类型以扩展功能”。我的意思是，这通常是 可能的 ，但您会发现有很多语言特性旨在不必这样做，而且通常情况下，用这种方式更好地表达开始，至少，通常在我看来。

在 python 原始文件 IO 中实现一致的块大小

Achieving consistent block sizing in python raw file IO

python

io

file-io

raw-file

raw

前面的问题：

详细解释