`groupby` 和参数解包的实现特定行为

Question

我试图理解我在今天早些时候写的 an answer 中发现的一个怪癖。基本上，我是从包装 itertools.groupby 的生成器函数中生成组的。我发现的有趣的事情是，如果我在赋值的左侧解压生成器，生成器的最后一个元素仍然存在。例如：

# test_gb.py
from itertools import groupby
from operator import itemgetter

inputs = ((x > 5, x) for x in range(10))

def make_groups(inputs):
    for _, group in groupby(inputs, key=itemgetter(0)):
        yield group

a, b = make_groups(inputs)
print(list(a))
print(list(b))

在 Cpython 上，这导致：

$ python3 ~/sandbox/test_gb.py 
[]
[(True, 9)]

CPython2.7和CPython3.5都是这种情况。

在 PyPy 上，结果为：

$ pypy ~/sandbox/test_gb.py 
[]
[]

在这两种情况下，第一个空列表 ("a") 很容易解释——只要需要下一个元素，来自 itertools 的组就会被消耗掉。由于我们没有在任何地方保存这些值，因此它们丢失了。

在我看来，PyPy 版本对于第二个空列表 ("b") 也是有意义的......解包时，我们也消耗了 b（因为python 需要寻找之后的内容，以确保它不应该为错误数量的项目抛出 ValueError 来解包）。但是出于某种原因，CPython 版本保留了输入迭代中的最后一个元素……谁能解释为什么会这样？

编辑

这可能或多或少是显而易见的，但我们也可以将其写为：

inputs = ((x > 5, x) for x in range(10))
(_, a), (_, b) = groupby(inputs, key=itemgetter(0))
print(list(a))
print(list(b))

并得到相同的结果...

Answer 1

这是因为 groupby 对象处理簿记，而 grouper 对象只引用它们的 key 和父 groupby 对象：

typedef struct {
    PyObject_HEAD
    PyObject *it;          /* iterator over the input sequence */
    PyObject *keyfunc;     /* the second argument for the groupby function */
    PyObject *tgtkey;      /* the key for the current "grouper" */
    PyObject *currkey;     /* the key for the current "item" of the iterator*/
    PyObject *currvalue;   /* the plain value of the current "item" */
} groupbyobject;

typedef struct {
    PyObject_HEAD
    PyObject *parent;      /* the groupby object */
    PyObject *tgtkey;      /* the key value for this grouper object. */
} _grouperobject;

因为在解压 groupby 对象时没有迭代 grouper 对象，所以我暂时忽略它们。所以有趣的是当你在 groupby 上调用 next 时会发生什么：

static PyObject *
groupby_next(groupbyobject *gbo)
{
    PyObject *newvalue, *newkey, *r, *grouper;

    /* skip to next iteration group */
    for (;;) {
        if (gbo->currkey == NULL)
            /* pass */;
        else if (gbo->tgtkey == NULL)
            break;
        else {
            int rcmp;

            rcmp = PyObject_RichCompareBool(gbo->tgtkey, gbo->currkey, Py_EQ);
            if (rcmp == 0)
                break;
        }

        newvalue = PyIter_Next(gbo->it);
        if (newvalue == NULL)
            return NULL;   /* just return NULL, no invalidation of attributes */
        newkey = PyObject_CallFunctionObjArgs(gbo->keyfunc, newvalue, NULL);

        gbo->currkey = newkey;
        gbo->currvalue = newvalue;
    }
    gbo->tgtkey = gbo->currkey;

    grouper = _grouper_create(gbo, gbo->tgtkey);
    r = PyTuple_Pack(2, gbo->currkey, grouper);
    return r;
}

我删除了所有不相关的异常处理代码，并删除或简化了纯引用计数的内容。这里有趣的是，当你到达迭代器的末尾时，gbo->currkey、gbo->currvalue 和 gbo->tgtkey 并未设置为 NULL，它们仍将指向最后一个遇到值（迭代器的最后一项）因为它只是 return NULL 当 PyIter_Next(gbo->it) == NULL.

完成后，您将获得两个 grouper 对象。第一个的 tgtvalue 为 False，第二个为 True。让我们看看当您在这些 grouper 上调用 next 时会发生什么：

static PyObject *
_grouper_next(_grouperobject *igo)
{
    groupbyobject *gbo = (groupbyobject *)igo->parent;
    PyObject *newvalue, *newkey, *r;
    int rcmp;

    if (gbo->currvalue == NULL) {
        /* removed because irrelevant. */
    }

    rcmp = PyObject_RichCompareBool(igo->tgtkey, gbo->currkey, Py_EQ);
    if (rcmp <= 0)
        /* got any error or current group is end */
        return NULL;

    r = gbo->currvalue;  /* this accesses the last value of the groupby object */
    gbo->currvalue = NULL;
    gbo->currkey = NULL;

    return r;
}

所以请记住 currvalue 是 而不是 NULL，所以第一个 if 分支并不有趣。对于您的第一个石斑鱼，它比较 grouper 和 groupby 对象的 tgtkey，发现它们不同，它会立即 return NULL。所以你得到了一个空列表。

对于第二个迭代器，tgtkey 是相同的，因此它将 return groupby 对象的 currvalue（这是迭代器中最后遇到的值！），但这次它将 groupby 对象的 currvalue 和 currkey 设置为 NULL.

切换回 python：如果您的 grouper 与 groupby 中的最后一组具有相同的 tgtkey，就会发生真正有趣的怪癖：

import itertools

>>> inputs = [(x > 5, x) for x in range(10)] + [(False, 10)]
>>> (_, g1), (_, g2), (_, g3) = itertools.groupby(inputs, key=lambda x: x[0])
>>> list(g1)
[(False, 10)]
>>> list(g3)
[]

g1 中的那个元素根本不属于第一组 - 但因为第一个石斑鱼对象的 tgtkey 是 False 而最后一个 tgtkey是False第一条石斑鱼认为它属于第一组。它还使 groupby 对象无效，因此第三组现在为空。

所有代码均取自 the Python source code 但已缩短。

`groupby` 和参数解包的实现特定行为

Implementation specific behavior of `groupby` and argument unpacking

python

pypy

python-internals