具有提供的（至少估计的）熵的 C++ 随机生成器

Question

使用 C++ 标准随机生成器，我可以使用语言提供的工具或多或少地有效地创建具有预定义分布的序列。香农熵呢？是否可以通过某种方式为提供的序列定义输出香农熵？

我尝试了一个小实验，生成了一个足够长的线性分布序列，并实现了香农熵计算器。结果值从 0.0（绝对有序）到 8.0（绝对混乱）

template <typename T>
double shannon_entropy(T first, T last)
{
    size_t frequencies_count{};
    double entropy = 0.0;

    std::for_each(first, last, [&entropy, &frequencies_count] (auto item) mutable {

        if (0. == item) return;
        double fp_item = static_cast<double>(item);
        entropy += fp_item * log2(fp_item);
        ++frequencies_count;
    });

    if (frequencies_count > 256) {
        return -1.0;
    }

    return -entropy;
}

std::vector<uint8_t> generate_random_sequence(size_t sequence_size)
{
    std::vector<uint8_t> random_sequence;
    std::random_device rnd_device;

    std::cout << "Random device entropy: " << rnd_device.entropy() << '\n';

    std::mt19937 mersenne_engine(rnd_device());
    std::uniform_int_distribution<unsigned> dist(0, 255);

    auto gen = std::bind(dist, mersenne_engine);
    random_sequence.resize(sequence_size);
    std::generate(random_sequence.begin(), random_sequence.end(), gen);
    return std::move(random_sequence);
}

std::vector<double> read_random_probabilities(size_t sequence_size)
{
    std::vector<size_t> bytes_distribution(256);
    std::vector<double> bytes_frequencies(256);

    std::vector<uint8_t> random_sequence = generate_random_sequence(sequence_size);

    size_t rnd_seq_size = random_sequence.size();
    std::for_each(random_sequence.begin(), random_sequence.end(), [&](uint8_t b) mutable {
        ++bytes_distribution[b];
    });

    std::transform(bytes_distribution.begin(), bytes_distribution.end(), bytes_frequencies.begin(),
        [&rnd_seq_size](size_t item) {
        return static_cast<double>(item) / rnd_seq_size;
    });
    return std::move(bytes_frequencies);
}

int main(int argc, char* argv[]) {

    size_t sequence_size = 1024 * 1024;
    std::vector<double> bytes_frequencies = read_random_probabilities(sequence_size);
    double entropy = shannon_entropy(bytes_frequencies.begin(), bytes_frequencies.end());

    std::cout << "Sequence entropy: " << std::setprecision(16) << entropy << std::endl;

    std::cout << "Min possible file size assuming max theoretical compression efficiency:\n";
    std::cout << (entropy * sequence_size) << " in bits\n";
    std::cout << ((entropy * sequence_size) / 8) << " in bytes\n";

    return EXIT_SUCCESS;
}

首先，似乎 std::random_device::entropy() 在 MSVC 2015 中硬编码为 return 32;（根据 Shannon 的定义，这可能是 8.0）。正如你可以尝试的那样，它离真相不远，这个例子它总是接近 7.9998...，即绝对混乱。

工作示例在 IDEONE（顺便说一句，他们的编译器硬编码熵为 0）

还有一个主要问题 - 是否可以创建这样一个生成器来生成具有 定义的 熵的线性分布序列，比如说 6.0 到 7.0？是否可以完全实现，如果可以，是否有一些实现？

Answer 1

我还不能发表评论，但我想开始讨论：从 communication/information 理论来看，您似乎需要概率整形方法来实现您想要的。您应该能够通过整形编码器提供任何分布函数的输出，然后应该 re-distribute 将输入输入到特定的目标香农熵。概率星座整形已成功应用于fiber-optic通信：Wikipedia with some other links

Answer 2

你不清楚你想要实现什么，有几种方法可以降低你序列的香农熵：

比特之间的相关性，例如将 random_sequence 通过 a 简单的过滤器。
个别位不是完全随机的。

作为下面的示例，您可以使字节不那么随机：

 std::vector<uint8_t> generate_random_sequence(size_t sequence_size, 
  int unit8_t cutoff=10)
{
    std::vector<uint8_t> random_sequence;
    std::vector<uint8_t> other_sequence;
    std::random_device rnd_device;

    std::cout << "Random device entropy: " << rnd_device.entropy() << '\n';

    std::mt19937 mersenne_engine(rnd_device());
    std::uniform_int_distribution<unsigned> dist(0, 255);

    auto gen = std::bind(dist, mersenne_engine);
    random_sequence.resize(sequence_size);
    std::generate(random_sequence.begin(), random_sequence.end(), gen);
    other_sequence.resize(sequence_size);
    std::generate(other_sequence.begin(), other_sequence.end(), gen);
    for(size_t j=0;j<size;++j) {
      if (other_sequence[j]<=cutoff) random_sequence[j]=0; // Or j or ...
    }
    return std::move(random_sequence);
}

我认为这不是您要找的答案 - 因此您可能需要进一步澄清问题。

Answer 3

首先，您对香农理论的看法完全错误。他的论点（正如您正在使用的那样）很简单，“给定 x (Pr(x)) 的可能性，存储 x 所需的位是 -log2 Pr(x)。它有与 x 的概率无关。在这方面，您认为 Pr(x) 是错误的。-log2 Pr(x) 给出的 Pr(x) 应该是一致的 1/256 结果在所需的位宽 8 位中进行存储。但是，统计数据不是这样工作的。回过头来考虑 Pr(x)，因为所需的位没有任何意义。

你的问题是关于统计的。给定一个无限样本，if-and-only-if 分布与理想直方图匹配，随着样本量接近无限大，每个样本的概率将接近预期频率。我想明确表示，您不是在寻找“-log2 Pr(x) 在 8 给定 Pr(x) = 1/256 时绝对混乱”。均匀分布是而不是混乱。事实上，它是……好吧，统一。它的属性众所周知、简单且易于预测。您正在寻找，"Is the finite sample set of S meeting the criteria of a independently-distributed uniform distribution (commonly known as "Independently and Identically Distributed Data" or "i.i.d") of Pr(x) = 1/256?" This has nothing to do with Shannon's theory and goes much further back in time to the basic probability theories involving flips of a coin (in this case binomial 给定假定的均匀性）。

暂时假设任何 C++11 <random> 生成器满足 "statistically indistinguishable from i.i.d." 的标准（顺便说一下，那些生成器不满足），您可以使用它们来模拟i.i.d。结果。如果您想要一系列可存储在 6..7 位以内的数据（不清楚，您的意思是 6 还是 7 位，因为假设，介于两者之间的所有内容都是可行的以及），只需缩放范围即可。例如...

#include <iostream>
#include <random>

int main() {
    unsigned long low = 1 << 6; // 2^6 == 64
    unsigned long limit = 1 << 7; // 2^7 == 128
    // Therefore, the range is 6-bits to 7-bits (or 64 + [128 - 64])
    unsigned long range = limit - low;
    std::random_device rd;
    std::mt19937 rng(rd()); //<< Doesn't actually meet criteria for i.d.d.
    std::uniform_int_distribution<unsigned long> dist(low, limit - 1); //<< Given an engine that actually produces i.i.d. data, this would produce exactly what you're looking for
    for (int i = 0; i != 10; ++i) {
        unsigned long y = dist(rng);
        //y is known to be in set {2^6..2^7-1} and assumed to be uniform (coin flip) over {low..low + (range-1)}.
        std::cout << y << std::endl;
    }
    return 0;
}

问题在于，虽然 <random> 分布 classes 是准确的，但随机数生成器（大概除了 std::random_device，但那是 system-specific ) 并非设计为经得起像 i.i.d. 生成器.

的适应性统计测试

如果你想要一个，实现一个 CSPRNG（我的 go-to 是 Bob Jenkins 的 ISAAC），它有一个满足 <random> [=96] 要求的接口=] 的生成器（可能只覆盖 std::random_device 的基本接口就足够了）。

要测试一组是否遵循特定模型的统计合理 "no" 或 "we can't say no"（因此 Pr(x) 是准确的，因此香农的熵函数是一个准确的预测），这完全是另一回事。正如我所说，<random> 中没有生成器符合这些标准（maybe std::random_device 除外）。我的建议是对 Central limit theorem, Goodness-of-fit, Birthday-spacing 等事物进行研究。

为了进一步说明我的观点，假设你的问题...

struct uniform_rng {
    unsigned long x;
    constexpr uniform_rng(unsigned long seed = 0) noexcept:
        x{ seed }
    { };

    unsigned long operator ()() noexcept {
        unsigned long y = this->x++;
        return y;
    }
};

...绝对符合您统一的标准（或者如您所说 "absolute chaos"）。 Pr(x) 肯定是 1/N 并且存储任意数量的集合所需的位是 -log2 Pr(1/N) 这是 unsigned long 位宽的 2 次方。但是，它不是独立分布的。因为我们知道它的属性，所以您可以通过简单地存储 seed "store" 它的整个序列。令人惊讶的是，所有 PRNG 都是这样工作的。因此，存储 PRNG 的 整个序列 所需的位是 -log2(1/2^bitsForSeed)。随着样本的增长，存储所需的位数与您能够生成该样本的位数（也称为压缩比）接近 0.

的极限

具有提供的（至少估计的）熵的 C++ 随机生成器

C++ random generator with provided (at least estimated) entropy

c++

random

entropy