排序 python 3.7+ 字典的最快方法
Fastest way to sort a python 3.7+ dictionary
既然 insertion order of Python dictionaries is guaranteed starting in Python 3.7 (and ),best/fastest 对字典进行排序的方法是什么 - 按值和键排序?
最明显的方法可能是这样的:
by_key = {k: dct[k] for k in sorted(dct.keys())}
by_value = {k: dct[k] for k in sorted(dct.keys(), key=dct.__getitem__)}
是否有其他更快的方法来做到这一点?
请注意,这个问题不是重复的,因为之前关于如何对字典进行排序的问题已经过时(答案基本上是,你不能;使用 collections.OrderedDict
而不是 ).
TL;DR:在 CPython 3.7 中按键或按值(分别)排序的最佳方法:
{k: d[k] for k in sorted(d)}
{k: v for k,v in sorted(d.items(), key=itemgetter(1))}
在 macbook 上测试 sys.version
:
3.7.0b4 (v3.7.0b4:eb96c37699, May 2 2018, 04:13:13)
[Clang 6.0 (clang-600.0.57)]
使用包含 1000 个浮点数的字典的一次性设置:
>>> import random
>>> from operator import itemgetter
>>> random.seed(123)
>>> d = {random.random(): random.random() for i in range(1000)}
按键排序数字(从最好到最差):
>>> %timeit {k: d[k] for k in sorted(d)}
# 296 µs ± 2.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit {k: d[k] for k in sorted(d.keys())}
# 306 µs ± 9.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit dict(sorted(d.items(), key=itemgetter(0)))
# 345 µs ± 4.15 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit {k: v for k,v in sorted(d.items(), key=itemgetter(0))}
# 359 µs ± 2.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit dict(sorted(d.items(), key=lambda kv: kv[0]))
# 391 µs ± 8.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit dict(sorted(d.items()))
# 409 µs ± 9.33 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit {k: v for k,v in sorted(d.items())}
# 420 µs ± 5.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit {k: v for k,v in sorted(d.items(), key=lambda kv: kv[0])}
# 432 µs ± 39.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
按值对数字进行排序(从最好到最差):
>>> %timeit {k: v for k,v in sorted(d.items(), key=itemgetter(1))}
# 355 µs ± 2.24 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit dict(sorted(d.items(), key=itemgetter(1)))
# 375 µs ± 31.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit {k: v for k,v in sorted(d.items(), key=lambda kv: kv[1])}
# 393 µs ± 1.89 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit dict(sorted(d.items(), key=lambda kv: kv[1]))
# 402 µs ± 9.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit {k: d[k] for k in sorted(d, key=d.get)}
# 404 µs ± 3.55 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit {k: d[k] for k in sorted(d, key=d.__getitem__)}
# 404 µs ± 20.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit {k: d[k] for k in sorted(d, key=lambda k: d[k])}
# 480 µs ± 12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
使用大量字符串的一次性设置:
>>> import random
>>> from pathlib import Path
>>> from operator import itemgetter
>>> random.seed(456)
>>> words = Path('/usr/share/dict/words').read_text().splitlines()
>>> random.shuffle(words)
>>> keys = words.copy()
>>> random.shuffle(words)
>>> values = words.copy()
>>> d = dict(zip(keys, values))
>>> list(d.items())[:5]
[('ragman', 'polemoscope'),
('fenite', 'anaesthetically'),
('pycnidiophore', 'Colubridae'),
('propagate', 'premiss'),
('postponable', 'Eriglossa')]
>>> len(d)
235886
按键对字符串字典进行排序:
>>> %timeit {k: d[k] for k in sorted(d)}
# 387 ms ± 1.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit {k: d[k] for k in sorted(d.keys())}
# 387 ms ± 2.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit dict(sorted(d.items(), key=itemgetter(0)))
# 461 ms ± 1.61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit dict(sorted(d.items(), key=lambda kv: kv[0]))
# 466 ms ± 2.62 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit {k: v for k,v in sorted(d.items(), key=itemgetter(0))}
# 488 ms ± 10.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit {k: v for k,v in sorted(d.items(), key=lambda kv: kv[0])}
# 536 ms ± 16.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit dict(sorted(d.items()))
# 661 ms ± 9.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit {k: v for k,v in sorted(d.items())}
# 687 ms ± 5.38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
按值对字符串字典进行排序:
>>> %timeit {k: v for k,v in sorted(d.items(), key=itemgetter(1))}
# 468 ms ± 5.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit dict(sorted(d.items(), key=itemgetter(1)))
# 473 ms ± 2.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit dict(sorted(d.items(), key=lambda kv: kv[1]))
# 492 ms ± 9.06 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit {k: v for k,v in sorted(d.items(), key=lambda kv: kv[1])}
# 496 ms ± 1.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit {k: d[k] for k in sorted(d, key=d.__getitem__)}
# 533 ms ± 5.33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit {k: d[k] for k in sorted(d, key=d.get)}
# 544 ms ± 6.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit {k: d[k] for k in sorted(d, key=lambda k: d[k])}
# 566 ms ± 5.77 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
注意:现实世界的数据通常包含长 运行s 的已排序序列,Timsort 算法可以利用这些序列。如果对 dict 进行排序在您的快速路径上,那么建议在得出有关最佳方法的任何结论之前,使用您自己的典型数据在您自己的平台上进行基准测试。我在每个时间结果前添加了一个注释字符 (#
),这样 IPython 用户可以 copy/paste 整个代码块来重新 运行 他们自己平台上的所有测试.
既然 insertion order of Python dictionaries is guaranteed starting in Python 3.7 (and
最明显的方法可能是这样的:
by_key = {k: dct[k] for k in sorted(dct.keys())}
by_value = {k: dct[k] for k in sorted(dct.keys(), key=dct.__getitem__)}
是否有其他更快的方法来做到这一点?
请注意,这个问题不是重复的,因为之前关于如何对字典进行排序的问题已经过时(答案基本上是,你不能;使用 collections.OrderedDict
而不是 ).
TL;DR:在 CPython 3.7 中按键或按值(分别)排序的最佳方法:
{k: d[k] for k in sorted(d)}
{k: v for k,v in sorted(d.items(), key=itemgetter(1))}
在 macbook 上测试 sys.version
:
3.7.0b4 (v3.7.0b4:eb96c37699, May 2 2018, 04:13:13)
[Clang 6.0 (clang-600.0.57)]
使用包含 1000 个浮点数的字典的一次性设置:
>>> import random
>>> from operator import itemgetter
>>> random.seed(123)
>>> d = {random.random(): random.random() for i in range(1000)}
按键排序数字(从最好到最差):
>>> %timeit {k: d[k] for k in sorted(d)}
# 296 µs ± 2.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit {k: d[k] for k in sorted(d.keys())}
# 306 µs ± 9.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit dict(sorted(d.items(), key=itemgetter(0)))
# 345 µs ± 4.15 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit {k: v for k,v in sorted(d.items(), key=itemgetter(0))}
# 359 µs ± 2.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit dict(sorted(d.items(), key=lambda kv: kv[0]))
# 391 µs ± 8.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit dict(sorted(d.items()))
# 409 µs ± 9.33 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit {k: v for k,v in sorted(d.items())}
# 420 µs ± 5.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit {k: v for k,v in sorted(d.items(), key=lambda kv: kv[0])}
# 432 µs ± 39.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
按值对数字进行排序(从最好到最差):
>>> %timeit {k: v for k,v in sorted(d.items(), key=itemgetter(1))}
# 355 µs ± 2.24 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit dict(sorted(d.items(), key=itemgetter(1)))
# 375 µs ± 31.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit {k: v for k,v in sorted(d.items(), key=lambda kv: kv[1])}
# 393 µs ± 1.89 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit dict(sorted(d.items(), key=lambda kv: kv[1]))
# 402 µs ± 9.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit {k: d[k] for k in sorted(d, key=d.get)}
# 404 µs ± 3.55 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit {k: d[k] for k in sorted(d, key=d.__getitem__)}
# 404 µs ± 20.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit {k: d[k] for k in sorted(d, key=lambda k: d[k])}
# 480 µs ± 12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
使用大量字符串的一次性设置:
>>> import random
>>> from pathlib import Path
>>> from operator import itemgetter
>>> random.seed(456)
>>> words = Path('/usr/share/dict/words').read_text().splitlines()
>>> random.shuffle(words)
>>> keys = words.copy()
>>> random.shuffle(words)
>>> values = words.copy()
>>> d = dict(zip(keys, values))
>>> list(d.items())[:5]
[('ragman', 'polemoscope'),
('fenite', 'anaesthetically'),
('pycnidiophore', 'Colubridae'),
('propagate', 'premiss'),
('postponable', 'Eriglossa')]
>>> len(d)
235886
按键对字符串字典进行排序:
>>> %timeit {k: d[k] for k in sorted(d)}
# 387 ms ± 1.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit {k: d[k] for k in sorted(d.keys())}
# 387 ms ± 2.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit dict(sorted(d.items(), key=itemgetter(0)))
# 461 ms ± 1.61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit dict(sorted(d.items(), key=lambda kv: kv[0]))
# 466 ms ± 2.62 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit {k: v for k,v in sorted(d.items(), key=itemgetter(0))}
# 488 ms ± 10.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit {k: v for k,v in sorted(d.items(), key=lambda kv: kv[0])}
# 536 ms ± 16.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit dict(sorted(d.items()))
# 661 ms ± 9.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit {k: v for k,v in sorted(d.items())}
# 687 ms ± 5.38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
按值对字符串字典进行排序:
>>> %timeit {k: v for k,v in sorted(d.items(), key=itemgetter(1))}
# 468 ms ± 5.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit dict(sorted(d.items(), key=itemgetter(1)))
# 473 ms ± 2.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit dict(sorted(d.items(), key=lambda kv: kv[1]))
# 492 ms ± 9.06 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit {k: v for k,v in sorted(d.items(), key=lambda kv: kv[1])}
# 496 ms ± 1.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit {k: d[k] for k in sorted(d, key=d.__getitem__)}
# 533 ms ± 5.33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit {k: d[k] for k in sorted(d, key=d.get)}
# 544 ms ± 6.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit {k: d[k] for k in sorted(d, key=lambda k: d[k])}
# 566 ms ± 5.77 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
注意:现实世界的数据通常包含长 运行s 的已排序序列,Timsort 算法可以利用这些序列。如果对 dict 进行排序在您的快速路径上,那么建议在得出有关最佳方法的任何结论之前,使用您自己的典型数据在您自己的平台上进行基准测试。我在每个时间结果前添加了一个注释字符 (#
),这样 IPython 用户可以 copy/paste 整个代码块来重新 运行 他们自己平台上的所有测试.