如何在使用 itertools.tee 检查下一个元素时最小化 space 成本？

Question

我正在尝试使用 itertools.tee 来知道迭代器是否为空而不完全消耗它：

from itertools import tee
def get_iterator(i):
    i1, i2 = tee(i, 2)
    if next(i1, None) is None:
       # iterator is empty - raises some error
       pass
    return i2 # return not empty iterator to caller

如发球台的 docs 所示：

This itertool may require significant auxiliary storage (depending on how much temporary data needs to be stored). In general, if one iterator uses most or all of the data before another iterator starts, it is faster to use list() instead of tee().

很明显，当 i 不为空时，i2 在 i1 之前使用大部分数据。一个简单的 del 可以解决这个问题吗？:

from itertools import tee
def get_iterator(i):
    i1, i2 = tee(i, 2)
    if next(i1, None) is None:
       # iterator is empty - raises some error
       pass
    del i1  # Does this overcome storage issue?
    return i2  # return not empty iterator to caller

有没有更好的方法来实现这个目标？

提前致谢！

Answer 1

我的意思是，在你的具体情况下，这有什么问题

from itertools import chain
def get_iterator(i):
    try:
        first = next(i):
    except StopIteration:
       # iterator is empty - raises some error
       pass
    return chain([first], i)

它做完全相同的事情，但除了第一个值之外不存储任何东西。

Answer 2

这有点微妙 - 它依赖于 tee 函数的未记录的属性以及 garbage collector. The sample Python code would store all the items from the point where the iterators were created until they're consumed by each iterator, but one might easily imagine that the iterators would have a cleanup effect that would drop their claim on data in the queue. But even so, del removes your name; it doesn't guarantee the object's destruction. Such a cleanup would thus work but not necessarily at the time you expect it. Knowing whether this happens would require reading the source code for tee. It does have weak reference 对单个迭代器支持的故意模糊的属性，提出了一种可以完成此优化的方法。

tee_next is reasonably simple; it holds a reference to a teedataobject, which is a batch of up to 57 items, also forming a singly linked list. Thus the normal reference counting semantics apply at that batch level. So basically, for CPython, up to 56 items are kept in memory even after they've been consumed by all the iterators, but no more than that because the reference count handling is immediate. As long as the tee iterators exist, any number of items between them can be held, but they do not read ahead from the source iterator; at least one tee iterator must have fetched the items via teedataobject_getitem 的 CPython 代码。

所以基本判断是：是的，del 将在 CPython 中工作，但是使用 tee 意味着您临时存储 57 个项目的批次，而不是 1 个。重复此方法可能会导致任意数量的此类 windows - 除了 tee 迭代器是可复制的，并将共享其基础列表。

这是对CPython的一个版本（4243df51fe43）的具体解释。实施将有所不同，例如PyPy、IronPython、Jython 或其他版本的 CPython。

例如，PyPy's tee（版本 cadf868）使用类似的 linked 列表，每个 link 一个项目，因此不会像此 CPython 版本那样进行批处理。

有一个显着的捷径可以防止这种成本增加：我检查的 tee 实现都产生了可复制的迭代器，也产生了可复制的迭代器。因此，重复应用 tee 不会创建新的迭代器层，这是 chain 方法的一个潜在问题。

如何在使用 itertools.tee 检查下一个元素时最小化 space 成本？

How to minimize space cost when using itertools.tee to check the next element?

python

itertools

tee