random.sample 中使用的常量的理由

Question

我正在查看 random.py（python 标准库）中函数示例的源代码。

思路很简单：

如果小样本(k)需要来自大人口(n)：只需选择k个随机指标，因为它由于人口如此之多，您不太可能两次选择相同的数字。如果你这样做了，就再选一次。
如果需要相对较大的样本 (k)，与总人口(n)相比：最好跟踪你选择的是什么。

我的问题

涉及到一些常量，setsize = 21 和 setsize += 4 ** _log(3*k,4)。临界比率大致为 k : 21+3k。评论说 # size of a small set minus size of an empty list 和 # table size for big sets.

这些具体数字是从哪里来的？有什么道理？

这些评论提供了一些启示，但我发现他们带来的问题与他们回答的问题一样多。

我有点理解，小集的大小但发现 "minus size of an empty list" 令人困惑。有人可以阐明这一点吗？
"table" 大小的具体含义，与 "set size" 相对应。

查看github库，好像是一个非常老的版本，只是简单地使用比例k : 6*k作为临界比例，但我觉得这同样神秘。

代码

def sample(self, population, k):
    """Chooses k unique random elements from a population sequence or set.

    Returns a new list containing elements from the population while
    leaving the original population unchanged.  The resulting list is
    in selection order so that all sub-slices will also be valid random
    samples.  This allows raffle winners (the sample) to be partitioned
    into grand prize and second place winners (the subslices).

    Members of the population need not be hashable or unique.  If the
    population contains repeats, then each occurrence is a possible
    selection in the sample.

    To choose a sample in a range of integers, use range as an argument.
    This is especially fast and space efficient for sampling from a
    large population:   sample(range(10000000), 60)
    """

    # Sampling without replacement entails tracking either potential
    # selections (the pool) in a list or previous selections in a set.

    # When the number of selections is small compared to the
    # population, then tracking selections is efficient, requiring
    # only a small set and an occasional reselection.  For
    # a larger number of selections, the pool tracking method is
    # preferred since the list takes less space than the
    # set and it doesn't suffer from frequent reselections.

    if isinstance(population, _Set):
        population = tuple(population)
    if not isinstance(population, _Sequence):
        raise TypeError("Population must be a sequence or set.  For dicts, use list(d).")
    randbelow = self._randbelow
    n = len(population)
    if not 0 <= k <= n:
        raise ValueError("Sample larger than population or is negative")
    result = [None] * k
    setsize = 21        # size of a small set minus size of an empty list
    if k > 5:
        setsize += 4 ** _ceil(_log(k * 3, 4)) # table size for big sets
    if n <= setsize:
        # An n-length list is smaller than a k-length set
        pool = list(population)
        for i in range(k):         # invariant:  non-selected at [0,n-i)
            j = randbelow(n-i)
            result[i] = pool[j]
            pool[j] = pool[n-i-1]   # move non-selected item into vacancy
    else:
        selected = set()
        selected_add = selected.add
        for i in range(k):
            j = randbelow(n)
            while j in selected:
                j = randbelow(n)
            selected_add(j)
            result[i] = population[j]
    return result

（很抱歉，这个问题最好放在 math.stackexchange 中。我想不出任何 probability/statistics-y 原因来解释这个特定的比例，评论听起来好像，也许与设置和列表使用的 space 数量有关 - 但无法在任何地方找到任何详细信息）。

Answer 1

此代码试图确定使用列表还是集合会花费更多 space（出于某种原因，而不是试图估计时间成本）。

看起来 21 是 Python 构建中空列表和小集大小的差异，该常量是在其上确定的，以指针大小的倍数表示.我没有那个版本的 Python，但是在我的 64 位 CPython 3.6.3 上测试给出了 20 个指针大小的差异：

>>> sys.getsizeof(set()) - sys.getsizeof([])
160

与引入此代码的 3.6.3 list and set struct definitions to the list and set definitions from the change 相比，21 似乎是合理的。

我说 "the difference between the size of an empty list and a small set" 是因为现在和当时，小集合使用包含在集合结构本身内部而不是外部分配的散列 table：

setentry smalltable[PySet_MINSIZE];

if k > 5:
    setsize += 4 ** _ceil(_log(k * 3, 4)) # table size for big sets

check 添加为大于 5 个元素的集合分配的外部 table 的大小，大小再次以指针数表示。此计算假设集合永远不会收缩，因为采样算法永远不会删除元素。我目前不确定这个计算是否准确。

最后，

if n <= setsize:

将集合的基本开销加上外部散列 table 使用的任何 space 与输入元素列表所需的 n 指针进行比较。（它似乎没有考虑 list(population) 执行的过度分配，因此它可能低估了列表的成本。）

random.sample 中使用的常量的理由

Justification of constants used in random.sample

python

random

sample

python-3.x