使用英特尔 oneAPI MKL 执行具有密集向量乘法的稀疏矩阵

Using Intel oneAPI MKL to perform sparse matrix with dense vector multiplication

我正在开发一个程序,大量使用 Armadillo 库。我有 10.8.2 版本,链接到 Intel oneAPI MKL 2022.0.2。在某些时候,我需要执行许多 稀疏矩阵乘以密集向量乘法,两者都是使用犰狳结构定义的。我发现这一点可能是一个瓶颈,并且很好奇是否用 MKL 中的“基本”稀疏 CBLAS 例程替换 Armadillo 乘法(mkl_sparse_d_mv) would speed things up. But in order to do so, I need to convert from Armadillo's SpMat to something that MKL understands. As per Armadillo docs, sparse matrices are stored in CSC 格式,所以我尝试过 mkl_sparse_d_create_csc。我的尝试如下:

#include <iostream>
#include <armadillo>
#include "mkl.h"

int main()
{

    arma::umat locations = {{0, 0, 1, 3, 2},{0, 1, 0, 2, 3}};
    // arma::vec vals = {0.5, 2.5, 2.5, 4.5, 4.5};
    arma::vec vals = {0.5, 2.5, 3.5, 4.5, 5.5};
    arma::sp_mat X(locations, vals);
    std::cout << "X = \n" << arma::mat(X) << std::endl;
    arma::vec v = {1,1,1,1};
    arma::vec v2;
    v2.resize(4);

    std::cout << "v = \n" << v << std::endl;
    std::cout << "X * v = \n" << X * v << std::endl;

    MKL_INT *cols_beg = static_cast<MKL_INT *>(mkl_malloc(X.n_cols * sizeof(MKL_INT), 64));
    MKL_INT *cols_end = static_cast<MKL_INT *>(mkl_malloc(X.n_cols * sizeof(MKL_INT), 64));
    MKL_INT *row_idx = static_cast<MKL_INT *>(mkl_malloc(X.n_nonzero * sizeof(MKL_INT), 64));
    double *values = static_cast<double *>(mkl_malloc(X.n_nonzero * sizeof(double), 64));

    for (MKL_INT i = 0; i < X.n_cols; i++)
    {
        cols_beg[i] = static_cast<MKL_INT>(X.col_ptrs[i]);
        cols_end[i] = static_cast<MKL_INT>((--X.end_col(i)).pos());
        std::cout << cols_beg[i] << " --- " << cols_end[i] << std::endl;
    }
    std::cout << std::endl;
    for (MKL_INT i = 0; i < X.n_nonzero; i++)
    {
        row_idx[i] = static_cast<MKL_INT>(X.row_indices[i]);
        values[i] = X.values[i];
        std::cout << row_idx[i] << " --- " << values[i] << std::endl;
    }
    std::cout << std::endl;
    sparse_matrix_t X_mkl = NULL;
    sparse_status_t res = mkl_sparse_d_create_csc(&X_mkl, SPARSE_INDEX_BASE_ZERO,
                                                  X.n_rows, X.n_cols, cols_beg, cols_end, row_idx, values);

    if(res == SPARSE_STATUS_SUCCESS) std::cout << "Constructed mkl representation of X" << std::endl;
    matrix_descr dsc;
    dsc.type = SPARSE_MATRIX_TYPE_GENERAL;

    sparse_status_t stat =  mkl_sparse_d_mv(SPARSE_OPERATION_NON_TRANSPOSE, 1.0, X_mkl, dsc, v.memptr(), 0.0, v2.memptr());
    std::cout << "Multiplication status = " << stat << std::endl;
    if(stat == SPARSE_STATUS_SUCCESS) 
    {
        std::cout << "Calculated X*v via mkl" << std::endl;
        std::cout << v2;
    }

    mkl_free(cols_beg);
    mkl_free(cols_end);
    mkl_free(row_idx);
    mkl_free(values);
    mkl_sparse_destroy(X_mkl);

    return 0;
}

我正在编译此代码(在 Link Line Advisor 的帮助下) icpc -g testing.cpp -o intel_testing.out -DARMA_ALLOW_FAKE_GCC -O3 -xhost -Wall -Wextra -L${MKLROOT}/lib/intel64 -liomp5 -lpthread -lm -DMKL_ILP64 -qmkl=parallel -larmadillo 在 Pop!_OS 21.10 上。 它编译和运行没有任何问题。输出结果如下:

X = 
   0.5000   2.5000        0        0
   3.5000        0        0        0
        0        0        0   5.5000
        0        0   4.5000        0

v = 
   1.0000
   1.0000
   1.0000
   1.0000

X * v = 
   3.0000
   3.5000
   5.5000
   4.5000

0 --- 1
2 --- 2
3 --- 3
4 --- 4

0 --- 0.5
1 --- 3.5
0 --- 2.5
3 --- 4.5
2 --- 5.5

Constructed mkl representation of X
Multiplication status = 0
Calculated X*v via mkl
   0.5000
        0
        0
        0

我们可以看到,Armadillo 的乘法结果是正确的,而 MKL 的结果是错误的。我的问题是:我是不是哪里弄错了?还是MKL有问题?。我当然怀疑前者,但是在花费了大量时间之后,什么也找不到。任何帮助将不胜感激!

编辑

根据 CJR 和 Vidyalatha_Intel 的建议,我已将 col_end 更改为

 cols_end[i] = static_cast<MKL_INT>((X.end_col(i)).pos());

现在的结果是

X = 
   0.5000   2.5000        0        0
   3.5000        0        0        0
        0        0        0   5.5000
        0        0   4.5000        0

v = 
   1.0000
   1.0000
   1.0000
   1.0000

X * v = 
   3.0000
   3.5000
   5.5000
   4.5000

0 --- 2
2 --- 3
3 --- 4
4 --- 5

0 --- 0.5
1 --- 3.5
0 --- 2.5
3 --- 4.5
2 --- 5.5

Constructed mkl representation of X
Multiplication status = 0
Calculated X*v via mkl
   4.0000
   2.5000
        0
        0

col_end确实是提示的2,3,4,5,结果还是错误

是的,正如 CJR 所指出的,cols_end 数组不正确。它们应该被索引为 2,3,4,5。请参阅有关函数参数的文档 mkl_sparse_d_create_csc

cols_end:

This array contains col indices, such that cols_end[i] - ind - 1 is the last index of col i in the arrays values and row_indx. ind takes 0 for zero-based indexing and 1 for one-based indexing.

https://www.intel.com/content/www/us/en/develop/documentation/onemkl-developer-reference-c/top/blas-and-sparse-blas-routines/inspector-executor-sparse-blas-routines/matrix-manipulation-routines/mkl-sparse-create-csc.html

更改此行

cols_end[i] = static_cast<MKL_INT>((--X.end_col(i)).pos());

cols_end[i] = static_cast<MKL_INT>((X.end_col(i)).pos());

现在重新编译 运行 代码。我已经对其进行了测试,它显示了正确的结果。带有结果和编译命令的图像