如何提高C++中merkle根的计算速度？

Question

我正在尝试尽可能优化默克尔根计算。到目前为止，我在 Python 中实现了它，结果是 this question 和用 C++ 重写它的建议。

#include <iostream>
#include <vector>
#include <string>
#include <fstream>
#include <streambuf>
#include <sstream>

#include <openssl/evp.h>
#include <openssl/sha.h>
#include <openssl/crypto.h>



std::vector<unsigned char> double_sha256(std::vector<unsigned char> a, std::vector<unsigned char> b)
{
    unsigned char inp[64];
    int j=0;
    for (int i=0; i<32; i++)
    {
        inp[j] = a[i];
        j++;
    }
    for (int i=0; i<32; i++)
    {
        inp[j] = b[i];
        j++;
    }

    const EVP_MD *md_algo = EVP_sha256();
    unsigned int md_len = EVP_MD_size(md_algo);
    std::vector<unsigned char> out( md_len );
    EVP_Digest(inp, 64, out.data(), &md_len, md_algo, nullptr);
    EVP_Digest(out.data(), md_len, out.data(), &md_len, md_algo, nullptr);
    return out;
}

std::vector<std::vector<unsigned char> > calculate_merkle_root(std::vector<std::vector<unsigned char> > inp_list)
{
   std::vector<std::vector<unsigned char> > out;
   int len = inp_list.size();
   if (len == 1)
   {
        out.push_back(inp_list[0]);
        return out;
   }
   for (int i=0; i<len-1; i+=2)
   {
        out.push_back(
            double_sha256(inp_list[i], inp_list[i+1])
        );
   }
   if (len % 2 == 1)
   {
        out.push_back(
            double_sha256(inp_list[len-1], inp_list[len-1])
        );
   }
   return calculate_merkle_root(out);
}



int main()
{
    std::ifstream infile("txids.txt");

    std::vector<std::vector<unsigned char> > txids;
    std::string line;
    int count = 0;
    while (std::getline(infile, line))
    {
        unsigned char* buf = OPENSSL_hexstr2buf(line.c_str(), nullptr);
        std::vector<unsigned char> buf2;
        for (int i=31; i>=0; i--)
        {
            buf2.push_back(
                buf[i]
            );
        }
        txids.push_back(
            buf2
        );
        count++;
    }
    infile.close();
    std::cout << count << std::endl;

    std::vector<std::vector<unsigned char> > merkle_root_hash;
    for (int k=0; k<1000; k++)
    {
        merkle_root_hash = calculate_merkle_root(txids);
    }
    std::vector<unsigned char> out0 = merkle_root_hash[0];
    std::vector<unsigned char> out;
    for (int i=31; i>=0; i--)
    {
        out.push_back(
            out0[i]
        );
    }

    static const char alpha[] = "0123456789abcdef";
    for (int i=0; i<32; i++)
    {
        unsigned char c = out[i];
        std::cout << alpha[ (c >> 4) & 0xF];
        std::cout << alpha[ c & 0xF];
    }
    std::cout.put('\n');

    return 0;
}

但是，与 Python 实现相比，性能更差（~4s）：

$ g++ test.cpp -L/usr/local/opt/openssl/lib -I/usr/local/opt/openssl/include -lcrypto
$ time ./a.out 
1452
289792577c66cd75f5b1f961e50bd8ce6f36adfc4c087dc1584f573df49bd32e

real      0m9.245s
user      0m9.235s
sys       0m0.008s

完整的实现和输入文件可在此处获得：test.cpp and txids.txt。

如何提高性能？默认情况下是否启用编译器优化？是否有比 openssl 更快的 sha256 库可用？

Answer 1

您可以做很多事情来优化代码。

以下是要点列表：

需要启用编译器优化（在 GCC 中使用 -O3）；
std::array 可以用来代替较慢的动态大小 std::vector（因为散列的大小是 32），甚至可以定义为清楚起见，新的 Hash 类型；
参数应该通过引用传递（C++默认通过复制传递参数）
可以保留 C++ 向量以预分配内存 space 并避免不需要的副本；
OPENSSL_free 以 释放 OPENSSL_hexstr2buf;

的已分配内存

push_back 当大小是编译时已知的常量时应避免使用；
使用 std::copy 通常比手动复制更快（更干净）；
std::reverse 通常比手动循环更快（更干净）；
散列的大小应该是 32，但是可以使用断言来检查它是否正确；
count 不需要，因为它是 txids 向量的大小；

这是结果代码：

#include <iostream>
#include <vector>
#include <string>
#include <fstream>
#include <streambuf>
#include <sstream>
#include <cstring>
#include <array>
#include <algorithm>
#include <cassert>

#include <openssl/evp.h>
#include <openssl/sha.h>
#include <openssl/crypto.h>

using Hash = std::array<unsigned char, 32>;

Hash double_sha256(const Hash& a, const Hash& b)
{
    assert(a.size() == 32 && b.size() == 32);

    unsigned char inp[64];
    std::copy(a.begin(), a.end(), inp);
    std::copy(b.begin(), b.end(), inp+32);

    const EVP_MD *md_algo = EVP_sha256();
    assert(EVP_MD_size(md_algo) == 32);

    unsigned int md_len = 32;
    Hash out;
    EVP_Digest(inp, 64, out.data(), &md_len, md_algo, nullptr);
    EVP_Digest(out.data(), md_len, out.data(), &md_len, md_algo, nullptr);
    return out;
}

std::vector<Hash> calculate_merkle_root(const std::vector<Hash>& inp_list)
{
   std::vector<Hash> out;
   int len = inp_list.size();
   out.reserve(len/2+2);
   if (len == 1)
   {
        out.push_back(inp_list[0]);
        return out;
   }
   for (int i=0; i<len-1; i+=2)
   {
        out.push_back(double_sha256(inp_list[i], inp_list[i+1]));
   }
   if (len % 2 == 1)
   {
        out.push_back(double_sha256(inp_list[len-1], inp_list[len-1]));
   }
   return calculate_merkle_root(out);
}

int main()
{
    std::ifstream infile("txids.txt");

    std::vector<Hash> txids;
    std::string line;
    while (std::getline(infile, line))
    {
        unsigned char* buf = OPENSSL_hexstr2buf(line.c_str(), nullptr);
        Hash buf2;
        std::copy(buf, buf+32, buf2.begin());
        std::reverse(buf2.begin(), buf2.end());
        txids.push_back(buf2);
        OPENSSL_free(buf);
    }
    infile.close();
    std::cout << txids.size() << std::endl;

    std::vector<Hash> merkle_root_hash;
    for (int k=0; k<1000; k++)
    {
        merkle_root_hash = calculate_merkle_root(txids);
    }
    Hash out0 = merkle_root_hash[0];
    Hash out = out0;
    std::reverse(out.begin(), out.end());

    static const char alpha[] = "0123456789abcdef";
    for (int i=0; i<32; i++)
    {
        unsigned char c = out[i];
        std::cout << alpha[ (c >> 4) & 0xF];
        std::cout << alpha[ c & 0xF];
    }
    std::cout.put('\n');

    return 0;
}

在我的机器上，此代码比初始版本快 3 倍，比 Python 实现快 2 倍。

此实现 花费 >98% 的时间在 EVP_Digest 中。因此，如果您想要更快的代码，您可以尝试找到一个 更快的散列库 尽管 OpenSSL 应该已经相当快了。目前的代码已经成功地在主流 CPU 上每秒连续计算 170 万个哈希值。这很好。或者，您也可以使用 OpenMP 并行程序（这在我的 6 核机器上大约快 5 倍）。

Answer 2

我决定从头开始实现 Merkle Root 和 SHA-256 计算，实现完整的 SHA-256，使用 SIMD（单指令多数据）方法，以 SSE2, AVX2, AVX512 闻名。

我下面的 AVX2 案例代码的速度比 OpenSSL 版本快 3.5x 倍，比 Python 的 hashlib 实现快 7.3x 倍。

这里我提供了C++实现，我也以同样的速度做了Python实现（因为核心使用了C++代码），Python实现见related post。 Python 实现绝对比 C++ 更容易使用。

我的代码相当复杂，既因为它有完整的 SHA-256 实现，也因为它有一个 class 用于抽象任何 SIMD 操作，还有很多测试。

首先我提供时间，在 Google Colab 上制作，因为那里有相当先进的 AVX2 处理器：

MerkleRoot-Ossl 1274 ms
MerkleRoot-Simd-GEN-1 1613 ms
MerkleRoot-Simd-GEN-2 1795 ms
MerkleRoot-Simd-GEN-4 788 ms
MerkleRoot-Simd-GEN-8 423 ms
MerkleRoot-Simd-SSE2-1 647 ms
MerkleRoot-Simd-SSE2-2 626 ms
MerkleRoot-Simd-SSE2-4 690 ms
MerkleRoot-Simd-AVX2-1 407 ms
MerkleRoot-Simd-AVX2-2 403 ms
MerkleRoot-Simd-AVX2-4 489 ms

Ossl用于测试OpenSSL实现，其余是我的实现。 AVX512 在速度上有更大的提升，这里不做测试，因为 Colab 不支持 AVX512。速度的实际提升取决于处理器能力。

编译在 Windows (MSVC) 和 Linux (CLang) 中测试，使用以下命令：

Windows 支持 OpenSSL cl.exe /O2 /GL /Z7 /EHs /std:c++latest sha256_simd.cpp -DSHS_HAS_AVX2=1 -DSHS_HAS_OPENSSL=1 /MD -Id:/bin/OpenSSL/include/ /link /LIBPATH:d:/bin/OpenSSL/lib/ libcrypto_static.lib libssl_static.lib Advapi32.lib User32.lib Ws2_32.lib，提供安装了 OpenSSL 的目录。如果不需要 OpenSSL 支持，请使用 cl.exe /O2 /GL /Z7 /EHs /std:c++latest sha256_simd.cpp -DSHS_HAS_AVX2=1。这里也可以使用 SSE2 或 AVX512 而不是 AVX2。 Windows openssl 可以从 here 下载。
Linux 如果需要 OpenSSL，则通过 clang++-12 -march=native -g -m64 -O3 -std=c++20 sha256_simd.cpp -o sha256_simd.exe -DSHS_HAS_OPENSSL=1 -lssl -lcrypto 完成 CLang 编译，如果不需要，则 clang++-12 -march=native -g -m64 -O3 -std=c++20 sha256_simd.cpp -o sha256_simd.exe。如您所见，使用了最新的 clang-12，要安装它，请执行 bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)"（此命令描述为 here）。 Linux 版本自动检测当前 CPU 架构并使用最佳 SIMD 指令集。

我的代码需要 C++20 标准支持，因为它使用一些高级功能来更轻松地实现所有内容。

我在我的库中实现了 OpenSSL 支持只是为了比较时间以表明我的 AVX2 版本快 3-3.5x 倍。

还提供在 GodBolt 上完成的计时，但这些只是 AVX-512 使用的示例，因为 GodBolt CPUs 具有先进的 AVX-512。不要使用 GodBolt 来实际测量时间，因为那里所有的时间都上下跳动了 5 倍，似乎是因为操作系统驱逐了活动进程。还提供 GodBolt link for playground（这个 link 可能有一些过时的代码，使用最新的 link 在我的 post 底部编码）：

MerkleRoot-Ossl 2305 ms
MerkleRoot-Simd-GEN-1 2982 ms
MerkleRoot-Simd-GEN-2 3078 ms
MerkleRoot-Simd-GEN-4 1157 ms
MerkleRoot-Simd-GEN-8 781 ms
MerkleRoot-Simd-GEN-16 349 ms
MerkleRoot-Simd-SSE2-1 387 ms
MerkleRoot-Simd-SSE2-2 769 ms
MerkleRoot-Simd-SSE2-4 940 ms
MerkleRoot-Simd-AVX2-1 251 ms
MerkleRoot-Simd-AVX2-2 253 ms
MerkleRoot-Simd-AVX2-4 777 ms
MerkleRoot-Simd-AVX512-1 257 ms
MerkleRoot-Simd-AVX512-2 741 ms
MerkleRoot-Simd-AVX512-4 961 ms

我的代码使用示例可以在 Test() 函数中看到，该函数测试我的库的所有功能。我的代码有点脏，因为我不想花太多时间创建漂亮的库，而只是为了证明基于 SIMD 的实现比 OpenSSL 版本快得多。

如果你真的想使用我的基于 SIMD 的提升版本而不是 OpenSSL，并且你非常关心速度，并且你对如何使用它有疑问，请在评论或聊天中问我。

此外，我没有费心实施 multi-core/multi-threaded 版本，我认为如何做到这一点是显而易见的，您可以而且应该毫无困难地实施它。

为下面的代码提供外部 link，因为我的代码大约 51 KB 大小，超过了 Whosebug post.[=37 允许的 30 KB 文本=]

sha256_simd.cpp

如何提高C++中merkle根的计算速度？

How to improve the speed of merkle root calculation in C++?

c++

recursion

performance

sha256

merkle-tree