C++ 在单个线程上调用 LAPACKE 运行而 NumPy 使用所有线程

Question

我写了一个 C++ 代码，它的瓶颈是一个可能很大的对称矩阵的对角化。该代码使用 OpenMP、CBLAS 和 LAPACKE C 接口。

但是，对 dsyev 的调用在我的本地计算机和 HPC 集群上都在单个线程上运行（如 htop 或等效工具所见）。对角化 12000x12000 矩阵大约需要 800 秒，而 NumPy 的函数 eigh 大约需要 250 秒。当然，在这两种情况下，$OMP_NUM_THREADS 都设置为线程数。

这是调用 LAPACK 的 C++ 代码示例，这基本上就是我在程序中所做的。（我正在读取二进制格式的矩阵）。

#include <stdlib.h>
#include <cstring>
#include <iostream>
#include <fstream>
#include <sstream>
#include <array>
#include <chrono>
#include <omp.h>
#include <cblas.h>
extern "C"
{
    #include <lapacke.h>
}

using namespace std;

const char uplo = 'L';
const char jobz = 'V';


int eigen(double* M, double* W, const int& N)
{

    printf("Diagonalizing...\n");

    int info;

    // Eigenvalues will be in W, Eigenvectors in T
    info = LAPACKE_dsyev(LAPACK_COL_MAJOR, jobz, uplo, N, M, N, W);

    if (info < 0)
    {
        printf("FATAL ERROR: Eigensolver (dsyev) failed: illegal argument %d\n",-info);
    }
    else if (info > 0)
    {
        printf("FATAL ERROR: Eigensolver (dsyev) failed: failed to converge %d\n",info);
    }

    return info;
}


int main(int argc, char* argv[])
{

    #ifdef _OPENMP
    if (getenv("OMP_NUM_THREADS"))
    {
        omp_set_num_threads(atoi(getenv("OMP_NUM_THREADS")));
        string str = "Running on "+to_string(omp_get_max_threads()) + " threads\n";
        printf(str.c_str());
    }
    #endif

    const int N = 12000;
    double t1 = 0.;
    double t2 = 0.;
    double* M = new double[N*N]{0.};
    double* W = new double[N]{0.};
    int info = 0;


    printf("Reading matrix.dat ...\n");
    ifstream file("matrix.dat");
    file.seekg(0, ios_base::end);
    size_t size = file.tellg();
    file.seekg(0, ios_base::beg);
    file.read((char*) &M[0], size);
    printf("Done.\n");




    t1 = omp_get_wtime();

    info = eigen(M,W,N);
    printf("Exit code: %i\n",info);

    t2 = omp_get_wtime();
    printf("Done in (s): %-10.3f\n",t2-t1);

    delete [] M;
    delete [] W;

    return 0;

}

很明显，NumPy 似乎链接到 LAPACK 的一个多线程版本，而我的程序不是。我认为任何 LAPACK 实现都是多线程的，因为它调用 BLAS，它（应该）总是多线程的。

在我的本地机器上，np.show_config() 给出以下内容：

blas_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/opt/miniconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/opt/miniconda3/include']
blas_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/opt/miniconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/opt/miniconda3/include']
lapack_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/opt/miniconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/opt/miniconda3/include']
lapack_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/opt/miniconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/opt/miniconda3/include']

虽然我将我的 C++ 代码编译到位于 /usr/local/opt/lapack/lib 和 /usr/local/opt/openblas/lib 等的 LAPACK 和 BLAS 库中...

如何解决这个问题，即如何确保 LAPACK 使用所有线程？

Answer 1

根据提供的信息，您的 C++ 代码似乎是 OpenBLAS 编辑的，而您的 Python 实现使用 Intel MKL.

OpenBLAS 是一个免费的开源库，它实现了基本的线性代数函数（称为 BLAS，如矩阵乘法、点产品等），但它几乎不支持高级线性代数函数（称为 LAPACK，如特征值、QR 分解等）。因此，虽然 OpenBLAS 的 BLAS 功能得到了很好的优化并且运行并行。 LAPACK 函数显然还没有得到很好的优化，并且大部分运行是连续的。

Intel MKL 是一个非自由的闭源库，同时实现了 BLAS 和 LAPACK 功能。 Intel 声称在执行 BLAS 和 LAPACK 函数方面具有高性能（至少在 Intel 处理器上）。实施得到了很好的优化，大多数是运行并行。

因此，如果您希望 C++ 代码至少与 Python 代码一样快，则需要 link MKL 而不是 OpenBLAS.

C++ 在单个线程上调用 LAPACKE 运行而 NumPy 使用所有线程

C++ call to LAPACKE run on a single thread while NumPy uses all threads

c++

parallel-processing

performance

numpy

openmp

C++ 在单个线程上调用 LAPACKE 运行 而 NumPy 使用所有线程

C++ call to LAPACKE run on a single thread while NumPy uses all threads

c++

parallel-processing

performance

numpy

openmp

C++ 在单个线程上调用 LAPACKE 运行而 NumPy 使用所有线程