使用堆进行大磁盘排序
Using heap for big disk sorts
在官方 Python 文档 here 中提到:
Heaps are also very useful in big disk sorts. You most probably all
know that a big sort implies producing “runs” (which are pre-sorted
sequences, whose size is usually related to the amount of CPU memory),
followed by a merging passes for these runs, which merging is often
very cleverly organised.
It is very important that the initial
sort produces the longest runs possible. Tournaments are a good way
to achieve that. If, using all the memory available to hold a
tournament, you replace and percolate items that happen to fit the
current run, you’ll produce runs which are twice the size of the
memory for random input, and much better for input fuzzily ordered.
Moreover, if you output the 0’th item on disk and get an input which
may not fit in the current tournament (because the value “wins” over
the last output value), it cannot fit in the heap, so the size of the
heap decreases. The freed memory could be cleverly reused
immediately for progressively building a second heap, which grows at
exactly the same rate the first heap is melting.
When the first heap
completely vanishes, you switch heaps and start a new run. Clever and
quite effective!
我知道一种叫做 External sorting 的算法,我们:
- 将输入分解为更小的块。
- 将所有块单独排序,然后将它们一个接一个地写回磁盘。
- 创建一个堆并在所有排序的块之间进行 k 路合并。
我完全理解维基百科上描述的外部排序,但我无法理解作者所说的:
If, using all the memory available to hold a tournament, you replace
and percolate items that happen to fit the current run, you’ll produce
runs which are twice the size of the memory for random input, and much
better for input fuzzily ordered.
和:
Moreover, if you output the 0’th item on disk and get an input which
may not fit in the current tournament (because the value “wins” over
the last output value), it cannot fit in the heap, so the size of the
heap decreases.
这是什么堆熔?
堆熔化不是一回事。就是作者用heap getting smaller 来拉出最小项的词
他所说的想法是 "divide the input into chunks and sort the chunks" 部分外部排序的巧妙替代。它产生更大的排序块。
这个想法是,首先将最大的块读入内存并将其排列到堆中,然后在读取新元素时开始从堆中写出最小的元素。
当你读入一个小于你已经写出的元素的元素时,它不能进入当前块(它会破坏排序),所以你会记住下一个块。可以将不小于您最后写出的元素的元素插入堆中。他们会把它放到当前区块中,使当前区块变大。
最终你的堆将是空的。到那时你就完成了当前块——堆化你记住的所有元素并开始写出下一个块。
在官方 Python 文档 here 中提到:
Heaps are also very useful in big disk sorts. You most probably all know that a big sort implies producing “runs” (which are pre-sorted sequences, whose size is usually related to the amount of CPU memory), followed by a merging passes for these runs, which merging is often very cleverly organised.
It is very important that the initial sort produces the longest runs possible. Tournaments are a good way to achieve that. If, using all the memory available to hold a tournament, you replace and percolate items that happen to fit the current run, you’ll produce runs which are twice the size of the memory for random input, and much better for input fuzzily ordered.Moreover, if you output the 0’th item on disk and get an input which may not fit in the current tournament (because the value “wins” over the last output value), it cannot fit in the heap, so the size of the heap decreases. The freed memory could be cleverly reused immediately for progressively building a second heap, which grows at exactly the same rate the first heap is melting.
When the first heap completely vanishes, you switch heaps and start a new run. Clever and quite effective!
我知道一种叫做 External sorting 的算法,我们:
- 将输入分解为更小的块。
- 将所有块单独排序,然后将它们一个接一个地写回磁盘。
- 创建一个堆并在所有排序的块之间进行 k 路合并。
我完全理解维基百科上描述的外部排序,但我无法理解作者所说的:
If, using all the memory available to hold a tournament, you replace and percolate items that happen to fit the current run, you’ll produce runs which are twice the size of the memory for random input, and much better for input fuzzily ordered.
和:
Moreover, if you output the 0’th item on disk and get an input which may not fit in the current tournament (because the value “wins” over the last output value), it cannot fit in the heap, so the size of the heap decreases.
这是什么堆熔?
堆熔化不是一回事。就是作者用heap getting smaller 来拉出最小项的词
他所说的想法是 "divide the input into chunks and sort the chunks" 部分外部排序的巧妙替代。它产生更大的排序块。
这个想法是,首先将最大的块读入内存并将其排列到堆中,然后在读取新元素时开始从堆中写出最小的元素。
当你读入一个小于你已经写出的元素的元素时,它不能进入当前块(它会破坏排序),所以你会记住下一个块。可以将不小于您最后写出的元素的元素插入堆中。他们会把它放到当前区块中,使当前区块变大。
最终你的堆将是空的。到那时你就完成了当前块——堆化你记住的所有元素并开始写出下一个块。