Python collections.Counter: most_common 复杂度

Question

Python中的collections.Counter对象提供的函数most_common的复杂度是多少？

更具体地说，是 Counter 在计数时保持某种排序列表，允许它执行 most_common 比 O(n) 更快的操作，当 n 是添加到柜台的（唯一）项目数量？供您参考，我正在处理一些大量的文本数据，试图找到第 n 个最常见的标记。

我查看了 CPython wiki 上的 official documentation and the TimeComplexity article，但找不到答案。

Answer 1

The source 准确显示发生了什么：

def most_common(self, n=None):
    '''List the n most common elements and their counts from the most
    common to the least.  If n is None, then list all element counts.

    >>> Counter('abracadabra').most_common(3)
    [('a', 5), ('r', 2), ('b', 2)]

    '''
    # Emulate Bag.sortedByCount from Smalltalk
    if n is None:
        return sorted(self.iteritems(), key=_itemgetter(1), reverse=True)
    return _heapq.nlargest(n, self.iteritems(), key=_itemgetter(1))

heapq.nlargest 定义在 heapq.py

Answer 2

从 collections.py 的源代码中我们看到，如果我们不指定 returned 元素的数量，most_common returns 的排序列表计数。这是一个 O(n log n) 算法。

如果我们使用 most_common 到 return k > 1 个元素，那么我们使用 heapq.nlargest。这是一个 O(k) + O((n - k) log k) + O(k log k) 算法，非常适合小常数 k，因为它本质上是线性的。 O(k) 部分来自堆化初始 k 计数，第二部分来自 n - k 调用 heappushpop 方法，第三部分来自对 [=16= 的最终堆进行排序] 元素。由于 k <= n 我们可以得出复杂度为：

O(n log k)

如果k = 1那么很容易证明复杂度是：

O(n)

Python collections.Counter: most_common 复杂度

Python collections.Counter: most_common complexity

python

counter

time-complexity

python-collections