为什么 collections.Counter 构建频率图所花费的时间少于 for 循环创建简单字典所花费的时间?
Why is the time taken by collections.Counter to construct a frequency map less than a simple dict creation by a for loop?
据我了解,由于 count()
对每个元素再次遍历列表,因此速度较慢。因为,Counter 和 dict 构造都在列表上迭代一次
dict 构造和 counter time 结果不应该相似吗?
我使用 作为获取时间值的代码参考。
import timeit
if __name__ == "__main__":
code_block = """seen = {}
for i in l:
if seen.get(i):
seen[i] += 1
else:
seen[i] = 1
"""
setup = """import random
import string
from collections import Counter
n=1000
l=[random.choice(string.ascii_letters) for x in range(n)]
"""
t1 = timeit.Timer(
stmt="Counter(l)",
setup=setup,
)
t2 = timeit.Timer(
stmt="[[x,l.count(x)] for x in set(l)]",
setup=setup,
)
t3 = timeit.Timer(
stmt=code_block,
setup=setup,
)
print("Counter(): ", t1.repeat(repeat=3, number=10000))
print("count(): ", t2.repeat(repeat=3, number=10000))
print("seen{}: ", t3.repeat(repeat=3, number=10000))
输出:
运行1:
Counter(): [0.32974308, 0.319977907, 0.301750341]
count(): [6.424047524000001, 6.417152854, 6.450776530000001]
seen{}: [1.1089669810000018, 1.099655232, 1.116015376]
运行 2:
Counter(): [0.322483783, 0.32464020800000004, 0.33498838900000005]
count(): [6.3235339029999995, 6.48233445, 6.543396192000001]
seen{}: [1.1192663550000006, 1.1072084830000009, 1.1155270229999985]
我更正了您的代码以对所有三个部分执行类似的操作:导入相同的包(以平衡开销),然后搜索整个字符列表,而不是将其缩减为唯一的字符串 -- 那是您在“看到”代码中的主要错误:您只计算了每个字符中的一个。
import timeit
if __name__ == '__main__':
code_block = '''seen = {}
for i in char:
if seen.get(i):
seen[i] += 1
else:
seen[i] = 1
'''
common = 'import random;import string;from collections import Counter;n=1000;' + \
'char=[random.choice(string.ascii_letters) for x in range(n)]'
t1 = timeit.Timer(stmt='Counter(char)',
setup=common)
t2 = timeit.Timer(stmt='[[x,char.count(x)] for x in set(char)]',
setup=common)
t3 = timeit.Timer(stmt=code_block,
setup=common)
print("Counter(): ", t1.repeat(repeat=3, number=10000))
print("count(): ", t2.repeat(repeat=3, number=10000))
print("seen{}: ", t3.repeat(repeat=3, number=10000))
输出:
Counter(): [0.48620019999999997, 0.49807440000000014, 0.3896322000000001]
count(): [9.7432961, 9.620701299999999, 9.674791500000001]
seen{}: [1.4734368999999994, 1.462895500000002, 1.4655799000000016]
看来 Counter
正如预期的那样是最快的。
TL;DR
Counter.__init__
正在使用 C 循环(至少在 CPython 中)来计算可迭代的元素,请参阅 https://github.com/python/cpython/blob/6b1ac809b9718a369aea67b99077cdd682be2238/Modules/_collectionsmodule.c#L2277.
Counter
(大部分,见下文)在 Python 中实现,这意味着它的代码可以很容易地检查、调试甚至修改。
Counter.__init__
in CPython 3.8(不包括文档字符串):
def __init__(self, iterable=None, /, **kwds):
super(Counter, self).__init__()
self.update(iterable, **kwds)
和Counter.update
(不包括无关路径):
def update(self, iterable=None, /, **kwds):
if iterable is not None:
if isinstance(iterable, _collections_abc.Mapping):
...
else:
_count_elements(self, iterable)
if kwds:
...
和_count_elements
:
def _count_elements(mapping, iterable):
mapping_get = mapping.get
for elem in iterable:
mapping[elem] = mapping_get(elem, 0) + 1
但是,在_count_elements
的实现下面有一段非常重要的代码+注释:
try: # Load C helper function if available
from _collections import _count_elements
except ImportError:
pass
换句话说,Counter
使用 C 循环来计算元素,这本质上比您正在使用的 Python 循环更快。
我们可以做个小实验。注释掉引入C函数的代码:
# try: # Load C helper function if available
# from _collections import _count_elements
# except ImportError:
# pass
正在执行您的代码:
Counter(): [1.8369901, 1.8359803000000001, 1.940804]
seen{}: [1.2392313000000001, 1.2483893999999998, 1.3157528000000003]
现在 Counter
实际上比普通字典慢。
据我了解,由于 count()
对每个元素再次遍历列表,因此速度较慢。因为,Counter 和 dict 构造都在列表上迭代一次
dict 构造和 counter time 结果不应该相似吗?
我使用 作为获取时间值的代码参考。
import timeit
if __name__ == "__main__":
code_block = """seen = {}
for i in l:
if seen.get(i):
seen[i] += 1
else:
seen[i] = 1
"""
setup = """import random
import string
from collections import Counter
n=1000
l=[random.choice(string.ascii_letters) for x in range(n)]
"""
t1 = timeit.Timer(
stmt="Counter(l)",
setup=setup,
)
t2 = timeit.Timer(
stmt="[[x,l.count(x)] for x in set(l)]",
setup=setup,
)
t3 = timeit.Timer(
stmt=code_block,
setup=setup,
)
print("Counter(): ", t1.repeat(repeat=3, number=10000))
print("count(): ", t2.repeat(repeat=3, number=10000))
print("seen{}: ", t3.repeat(repeat=3, number=10000))
输出:
运行1:
Counter(): [0.32974308, 0.319977907, 0.301750341]
count(): [6.424047524000001, 6.417152854, 6.450776530000001]
seen{}: [1.1089669810000018, 1.099655232, 1.116015376]
运行 2:
Counter(): [0.322483783, 0.32464020800000004, 0.33498838900000005]
count(): [6.3235339029999995, 6.48233445, 6.543396192000001]
seen{}: [1.1192663550000006, 1.1072084830000009, 1.1155270229999985]
我更正了您的代码以对所有三个部分执行类似的操作:导入相同的包(以平衡开销),然后搜索整个字符列表,而不是将其缩减为唯一的字符串 -- 那是您在“看到”代码中的主要错误:您只计算了每个字符中的一个。
import timeit
if __name__ == '__main__':
code_block = '''seen = {}
for i in char:
if seen.get(i):
seen[i] += 1
else:
seen[i] = 1
'''
common = 'import random;import string;from collections import Counter;n=1000;' + \
'char=[random.choice(string.ascii_letters) for x in range(n)]'
t1 = timeit.Timer(stmt='Counter(char)',
setup=common)
t2 = timeit.Timer(stmt='[[x,char.count(x)] for x in set(char)]',
setup=common)
t3 = timeit.Timer(stmt=code_block,
setup=common)
print("Counter(): ", t1.repeat(repeat=3, number=10000))
print("count(): ", t2.repeat(repeat=3, number=10000))
print("seen{}: ", t3.repeat(repeat=3, number=10000))
输出:
Counter(): [0.48620019999999997, 0.49807440000000014, 0.3896322000000001]
count(): [9.7432961, 9.620701299999999, 9.674791500000001]
seen{}: [1.4734368999999994, 1.462895500000002, 1.4655799000000016]
看来 Counter
正如预期的那样是最快的。
TL;DR
Counter.__init__
正在使用 C 循环(至少在 CPython 中)来计算可迭代的元素,请参阅 https://github.com/python/cpython/blob/6b1ac809b9718a369aea67b99077cdd682be2238/Modules/_collectionsmodule.c#L2277.
Counter
(大部分,见下文)在 Python 中实现,这意味着它的代码可以很容易地检查、调试甚至修改。
Counter.__init__
in CPython 3.8(不包括文档字符串):
def __init__(self, iterable=None, /, **kwds):
super(Counter, self).__init__()
self.update(iterable, **kwds)
和Counter.update
(不包括无关路径):
def update(self, iterable=None, /, **kwds):
if iterable is not None:
if isinstance(iterable, _collections_abc.Mapping):
...
else:
_count_elements(self, iterable)
if kwds:
...
和_count_elements
:
def _count_elements(mapping, iterable):
mapping_get = mapping.get
for elem in iterable:
mapping[elem] = mapping_get(elem, 0) + 1
但是,在_count_elements
的实现下面有一段非常重要的代码+注释:
try: # Load C helper function if available
from _collections import _count_elements
except ImportError:
pass
换句话说,Counter
使用 C 循环来计算元素,这本质上比您正在使用的 Python 循环更快。
我们可以做个小实验。注释掉引入C函数的代码:
# try: # Load C helper function if available
# from _collections import _count_elements
# except ImportError:
# pass
正在执行您的代码:
Counter(): [1.8369901, 1.8359803000000001, 1.940804]
seen{}: [1.2392313000000001, 1.2483893999999998, 1.3157528000000003]
现在 Counter
实际上比普通字典慢。