returns 个按排序顺序排列的唯一条目的随机数生成器

Question

我需要一个生成器来生成许多（最多一万亿，10^12）个唯一的随机 64 位数字。生成器需要按排序顺序 return 数字（Long.MIN_VALUE 到 Long.MAX_VALUE）。问题是对 $10^{12}$ 数字进行排序很慢。用例正在复制运行用于 BBHash (in the paper 的测试，4.5 索引万亿键）。

直接的解决方案是在内存中创建一个集合，使用一个巨大的位集合左右以确保没有重复 returned。但这会占用太多内存或 I/O。我最多想使用几 MB 的内部状态。

生成器应该在内部使用 java.util.Random。它应该尽可能 "fair" （具有与以其他方式生成的统计分布相同的统计分布）。我还想要一个 128 位数字（2 长）的版本。

我目前拥有的是在内存中创建集合的代码（Java 代码）：

public static void main(String... args) {
    for(long x : randomSet(10, 0)) {
        System.out.println(x);
    }
}

static Iterable<Long> randomSet(int size, int seed) {
    Random r = new Random(seed);
    TreeSet<Long> set = new TreeSet<Long>();
    while (set.size() < size) {
        set.add(r.nextLong());
    }
    return set;
}

-8292973307042192125
-7423979211207825555
-6688467811848818630
-4962768465676381896
-2228689144322150137
-1083761183081836303
-279624296851435688
4437113781045784766
6146794652083548235
7105486291024734541

最简单（错误）的解决方案（不是随机的）是平均分配结果。我认为 "add a random gap" 的解决方案不会奏效，因为它很慢，而且这种差距的总和在 10^12 之后不会落在它应该落在的地方（好吧，也许：记住剩下多少数字，然后重新计算分布......）。我认为以下应该可行，但是很复杂，并且不确定要使用什么公式：对于每个位级别，递归地计算可能会出现多少个 0 / 1 （以某种方式使用二项式分布或近似值，正态/高斯分布）。在某个点停止（比如，100 万个条目或更少的块），使用上面的代码，速度。但也许有一个优雅的解决方案。也许这与 Metropolis–Hastings 算法有关，不确定。我读 "An Efficient Algorithm for Sequential Random Sampling", 但我认为它只适用于小n，我发现很难从中得到一个简单的算法。

Java 代码最好，但 C 也可以（无论如何在某些时候我可能不得不将它转换为 C/C++）。我不想使用太多库，以简化移植。

Answer 1

满足要求

generate a sequence of random numbers r_i from a whole number interval I = [-(R+1), R], R > 0 with a statistical distribution like java.util.Random

the sequence r_i must be strictly increasing (r_i > r_j for i > j)

我们可以想出一个简单的算法

A1:
 - draw a random number r_i from I via a library call
 - discard it, if it is less or equal the last draw, try another pick

可能的抱怨是这个算法可能会给出不正确的生成数r_i，有一个关于总共 N=10^12 个预期数字的模糊要求

"need a generator for many (up to one trillion, 10^12) unique random 64-bit numbers"

解决方案是

A2:
 - to generate N numbers and then 
 - sort them

但是还有一个要求，即没有足够的可用内存。

"I'd like to use at most a few MB of internal state."

我的推测是不可能一次满足所有这些要求。

作为妥协，我建议

A3:
 R=2^63 = 9 10^18  
 N=1 Trillion = 10^12
 - divide the range I=[-R,R-1] into N intervals of length (2R+1)/N each 
 - visit each of those intervals (visiting one interval after another)
 - draw a random number from that interval

这将按递增顺序给出 N 个随机数。

更新：

浏览了几次 BBHash paper and sources 之后，这是我的理解：

给定一些整数集 I 和一个子集 S，其中 N=|S|元素，BBHash 过程将计算一个函数 f，它将 S 映射到 {1,..,N} 的某个排列（什么排列似乎由 BBHash 过程隐式决定）并将所有其他元素从 I 映射到一个特殊值 Imax来自 I.

可能的测试：

给定 S 和 f，可以检查是否正确计算了 I 中某个任意元素在 S 中的成员资格。

也可以检查 f(S) = {1,..,N}。

我的猜测是请求的算法旨在在内存预算紧张的情况下动态计算 N=10^12 的样本集 S，需要随机数序列的唯一性而不是单调。

引用

Probabilistic data structures can't give you a definite answer, instead they provide you with a reasonable approximation of the answer and a way to approximate this estimation. They are extremely useful for big data and streaming application because they allow to dramatically decrease the amount of memory needed (in comparison to data structures that give you exact answers).

In majority of the cases these data structures use hash functions to randomize the items. Because they ignore collisions they keep the size constant, but this is also a reason why they can't give you exact values.

在 BBHash 的情况下，使用了一系列不同的哈希函数 h_i。一个应用不同的 h_i 直到没有碰撞发生。这仅在输入是唯一的情况下才有效。它只有在实现有足够的不同 h_i 用于特定 S 时才会起作用。

Answer 2

10^12 约为 2^40，即连续值之间的平均步长为 2^24。

因此，如果目标是生成不可预测但有序的哈希序列，那么这是不可能的，2^24 对于暴力破解来说太容易了

但如果它不是目标，那么为什么不在高位加入增量 2^40 计数器，在低位加入 2^24 随机值？

Answer 3

你想要很多伪随机的 64 位数字，而且都是唯一的。给定唯一的输入和相同的密钥，加密是唯一的——它必须是唯一的，因为它是可逆的。 DES 是一种 64 位块密码，因此使用 ECB 模式在 DES 中加密数字 0、1、2、3、4、... 10^12 将为您提供一万亿个唯一的 64 位数字。使用相同的密钥，它们保证是唯一的，因为输入是唯一的。不同的密钥将给出一组不同的唯一数字，但有些可能与第一组中的数字重复。

对于 128 位数字，使用 AES，它具有 128 位块大小，同样采用 ECB 模式并使用固定密钥。

唯一需要的内部状态是您正在使用的密钥和一个数字，表示您在 [0..10^12] 范围内的距离。

您需要单独对输出进行排序。鉴于从存储的最后一个数字重新启动过程以生成下一批数字很容易，我怀疑合并排序会相对容易实现，每个新批次在您生成它时被合并到已经排序的主文件中.批量大小可以保持在内存容量以内，主要文件保存在光盘上。

此解决方案不使用 java.util.Random。这对你有多重要？加密被设计成随机出现，除了最复杂的密码分析，并且可能 'more random' 比标准 Java Random PRNG。

Answer 4

我们将随机值的范围称为 U。对于初学者来说，这是 64 位有符号整数的范围，因此有 2 ^ 64 个可能的值。让我们称您需要生成 N 的排序随机值的总数，您说的大约是 10 ^ 12。

预先决定使用多少内存是合理的。假设您的机器可以毫无问题地分配和使用 1GB。那是 134,217,728 个 64 位值。称之为 A（数组大小）。

N / A = 7450.58...，所以四舍五入为7451个桶，将A调整为ceil(N/7451)，即134,210,173。计算 R（桶范围）= U/7451.

Loop over 7451 buckets (B):
    Generate 134,210,173 random values in the range (0..R),
    inserting them into the array as they are produced. Binary
    insertion should be reasonable (N*log(N), just like generating
    them all then sorting, but you can use the insertion to catch
    duplicates so you don't need extra memory or time for that).

    Output the bucket of values, adding (B*R) to each.

你会超过N几个；如果这很关键，则根据需要随机 select 多个桶，并从每个桶中删除一个值。

Answer 5

我有办法。

（事实证明，以粗略排序的顺序生成 100'000 个或更多条目比使用大型 HashSet 生成更快。粗略排序意味着将 TreeSet 替换为a HashSet，并使用 10'000 而不是 5 的限制。这是因为重复测试要快得多。）

每个固定（子）范围的随机条目数

创建一棵树：对于每个位级别（从最高有效位开始），使用正态分布递归生成一个随机数，表示该级别的位应设置为 0 的条目数。其余条目在此级别的位设置为 1。在每个递归级别，这会将范围缩小大约一半。例如，当条目少于 100 万时停止，然后切换到使用内存中的伪 RNG 并对这些数字进行排序（或使用位域）。

这里有一些代码（尚未测试）：

public static void main(String... args) {
    Random r = new Random();
    Iterator<Long> it = randomSequence(r, 10, 32);
    while(it.hasNext()) {
        System.out.println(it.next());
    }
}

/**
 * Random sequence generator.
 *
 * @param r the random generator
 * @param size the number of entries to generate
 * @param shift the number of bits of the result
 * @return the iterator
 */
static Iterator<Long> randomSequence(final Random r, final long size, final int shift) {
    if (size < 5) {
        // small lists are generated using a regular hash set
        TreeSet<Long> set = new TreeSet<Long>();
        while (set.size() < size) {
            set.add(r.nextLong() & ((2L << shift) - 1));
        }
        return set.iterator();
    }
    // large lists are created recursively
    return new Iterator<Long>() {
        long remaining = size, zeros = randomHalf(r, size);
        Iterator<Long> lowBits0 = randomSequence(r, zeros, shift - 1);
        Iterator<Long> lowBits1;
        @Override
        public boolean hasNext() {
            return remaining > 0;
        }
        @Override
        public Long next() {
            remaining--;
            if (lowBits0.hasNext()) {
                return lowBits0.next();
            }
            if (lowBits1 == null) {
                lowBits1 = randomSequence(r, size - zeros, shift - 1);
            }
            return (1L << shift) + lowBits1.next();
        }
    };
}

/**
 * Get the number of entries that are supposed to be below the half,
 * according to the probability theory. For example, for a number of coin
 * flips, how many are heads.
 *
 * @param r the random generator
 * @param samples the total number of entries
 * @return the number of entries that should be used for one half
 */
static long randomHalf(Random r, long samples) {
    long low = 0, high = samples;
    double x = r.nextDouble();
    while (low + 1 < high) {
        long mid = (low + high) / 2;
        double p = probabilityBucketAtMost(samples, mid);
        if (x > p) {
            low = mid;
        } else {
            high = mid;
        }
    }
    return (low + high) / 2;
}

static double probabilityBucketAtMost(long flips, long heads) {
    // https://www.fourmilab.ch/rpkp/experiments/statistics.html
    long x = heads;
    long n = flips;
    double variance = Math.sqrt(n/4);
    // mean
    long mu = n / 2;
    // https://en.wikipedia.org/wiki/Normal_distribution
    // Numerical approximations for the normal CDF
    // the probability that the value of a standard normal random variable X is <= x
    return phi((x - mu) / variance);
}

static double phi(double x) {
    return 0.5 * (1 + Math.signum(x) * Math.sqrt(1 - Math.exp(-2 * x * x / Math.PI)));
}

returns 个按排序顺序排列的唯一条目的随机数生成器

Random number generator that returns unique entries in sorted order

java

random

algorithm

distribution

sequence

每个固定（子）范围的随机条目数