Timsort 中运行的大小

Question

我正在研究Timsort link hackernoon Timsort

有说法"Merging 2 arrays is more efficient when the number of runs is equal to, or slightly less than, a power of two. Timsort chooses minrun to try to ensure this efficiency, by making sure minrun is equal to or less than a power of two."

为什么 "Merging 2 arrays is more efficient when the number of runs is equal to, or slightly less than, a power of two"？

Answer 1

它用于 general/random 情况下的平衡合并（如果给定数据不是随机的但具有较长的自然运行，那么 minrun 因此它的选择可能并不重要，并且平衡更多地取决于运行的长度而不是运行的次数。

在 general/random 的情况下，您希望生成恰好 minrun 个元素的运行。然后，当运行次数为 2 的幂时，您将获得完全平衡的合并。如果运行次数比 2 的幂大一点，您最终会得到一个非常不平衡的合并。如果运行次数比 2 的幂小一点，您只会得到一点点不平衡。

同样，这是（至少大部分）针对 general/random 的情况。例如，如果你有 nine 个长度为 800、100、100、100、100、100、100、100、100 个元素的自然运行，那么你也会在整个过程中获得完美平衡的合并，尽管运行次数略大于比 2 的幂

摘自 Tim 在 listsort.txt 中关于 minrun 选择的描述：

Because sortperf.py only tries powers of 2, it took a long time to notice
that 32 isn't a good choice for the general case!  Consider N=2112:

>>> divmod(2112, 32)
(66, 0)
>>>

If the data is randomly ordered, we're very likely to end up with 66 runs
each of length 32.  The first 64 of these trigger a sequence of perfectly
balanced merges (see next section), leaving runs of lengths 2048 and 64 to
merge at the end.  The adaptive gimmicks can do that with fewer than 2048+64
compares, but it's still more compares than necessary, and-- mergesort's
bugaboo relative to samplesort --a lot more data movement (O(N) copies just
to get 64 elements into place).

If we take minrun=33 in this case, then we're very likely to end up with 64
runs each of length 33, and then all merges are perfectly balanced.  Better!

Timsort 中运行的大小

size of run in Timsort

sorting

algorithm

timsort

Timsort 中 运行 的大小

size of run in Timsort

sorting

algorithm

timsort

Timsort 中运行的大小