线程矩阵乘法

Threaded Matrix Multiplication

我正在研究矩阵乘法的线程实现,以便与我的自定义矩阵一起使用 class,我 运行 遇到了一些加速问题。

设置和调用操作如下所示:

Matrix<double> left(2,2);
left[0][0] = 1; left[0][1] = 2;
left[1][0] = 3; left[1][1] = 4;

Matrix<double> right(2,2);
right[0][0] = 1; right[0][1] = 0;
right[1][0] = 0; right[1][1] = 1;

Matrix<double> result = left * right;

这里是我的 class 的简化版本,其中包含查看操作工作方式所必需的实现。为了完整起见,我包括了我的构造函数、复制构造函数、赋值运算符和析构函数,但最后两个函数是我主要关注的。此外,我正在转置右被乘数以利用空间局部性,因此输出矩阵中的每个位置都是左被乘数行和右被乘数行的点积,与左被乘数行和一列右被乘数。这是代码:

#include <iostream>
#include <thread>
using namespace std;

template <class type>
class Matrix {

public:
    // Default Constructor
    Matrix() { this->create(); }
    // Custom Constructor
    Matrix(int r, int c) { this->create(r,c); }
    // Copy Constructor
    Matrix(const Matrix<type> &m) { this->copy(m); }
    // Assignment Operator
    Matrix<type> operator=(const Matrix<type> &m);
    // Destructor
    ~Matrix() { this->destroy(); }

    // Accessor Functions
    int getRows() const { return rows; }
    int getCols() const { return cols; }
    type* operator[](int i) const { return contents[i]; }

    // Matrix Operations
    Matrix<type> transpose() const;

    // Matrix Multiplication
    friend Matrix<type> operator*(const Matrix<type> &m1, const Matrix<type> &m2)
        { Matrix<type> call; return call.multiply(m1,m2); }

private:
    // Private Member Functions
    void create();
    void create(int r, int c);
    void copy(const Matrix<type> &m);
    void destroy();

    // Operator Overloading Functions
    Matrix<type> multiply(const Matrix<type> &m1, const Matrix<type> &m2);

    // Private Member Variables
    int rows;
    int cols;
    type** contents;

};


// Default Constructor
template <class type>
void Matrix<type>::create() {
    rows = 0;
    cols = 0;
    contents = NULL;
}


// Custom Constructor
template <class type>
void Matrix<type>::create(int r, int c) {
    // Set Integer Values
    rows = r;
    cols = c;

    // Allocate Two-Dimensional Data
    contents = new type*[r];
    for (int i = 0; i < r; i++) {
        contents[i] = new type[c];
    }

}


// Copy Constructor
template <class type>
void Matrix<type>::copy(const Matrix &m) {
    // Create New Matrix From Existing One
    this->rows = m.getRows();
    this->cols = m.getCols();

    // Allocate Two-Dimensional Data
    this->contents = new type*[m.getRows()];
    for (int i = 0; i < m.getRows(); i++) {
        this->contents[i] = new type[m.getCols()];
    }

    // Copy Over Data
    for (int i = 0; i < m.getRows(); i++) {
        for (int j = 0; j < m.getCols(); j++) {
            (*this)[i][j] = m[i][j];
        }
    }

}


// Assignment Operator
template <class type>
Matrix<type> Matrix<type>::operator=(const Matrix<type> &m) {
    // Overwrite Existing Matrix with Another
    // (Allowing for Self-Assignment) 
    if (this != &m) {
        this->destroy();
        this->copy(m);
    }

    return *this;

}


// Destructor
template <class type>
void Matrix<type>::destroy() {
    // Frees Allocated Memory
    for (int i = 0; i < rows; i++) {
        delete[] contents[i];
    }
    delete[] contents;

}


// Matrix Transpose
template <class type>
Matrix<type> Matrix<type>::transpose() const {

    Matrix<type> tran(cols,rows);

    for (int i = 0; i < rows; i++) {
        for (int j = 0; j < cols; j++) {
            tran[j][i] = contents[i][j];
        }
    }

    return tran;

}


// Threaded Matrix Multiplication
template <class type>
void matrixParallel(const Matrix<type>* a, const Matrix<type>* b, int numThreads, int currentThread, Matrix<type>* c) {

    for (int i = currentThread; i < a->getRows(); i+=numThreads) {
        for (int j = 0; j < b->getRows(); j++) {
            type result = 0;
            for (int k = 0; k < a->getCols(); k++) {
                result += ((*a)[i][k]*(*b)[j][k]);
            }
            (*c)[i][j] = result;
        }
    }

}


// Matrix Multiplication
template <class type>
Matrix<type> Matrix<type>::multiply(const Matrix<type> &m1, const Matrix<type> &m2) {
    if (m1.getCols() != m2.getRows()) {
        cout << "Error: Cannot Multiply Matrices of Dimensions ";
        cout << "(" << m1.getRows() << "x" << m1.getCols() << ")*";
        cout << "(" << m2.getRows() << "x" << m2.getCols() << ")" << endl;
        cout << "        (Must be in the form (MxN)*(NxP)" << endl;
        return Matrix<type>();
    }


    // Parallel Method
    Matrix<type> m2t = m2.transpose();
    Matrix<type> multiply(m1.getRows(), m2.getCols());

    int numCPU = thread::hardware_concurrency();
    thread* threads = new thread[numCPU];

    const Matrix<type>* m1Pointer = &m1;
    const Matrix<type>* m2tPointer = &m2t;
    Matrix<type>* multiplyPointer = &multiply;

    for (int i = 0; i < numCPU; i++) {
        threads[i] = thread(matrixParallel<type>, m1Pointer, m2tPointer, numCPU, i, multiplyPointer);
    }

    for (int i = 0; i < numCPU; i++) {
        threads[i].join();
    }

    delete[] threads;

    return multiply;

}

(注意,使用编译器选项 -std=c++11 在 Cygwin 中编译)

我的朋友 operator* 实现有点变通,因为我发现我在模板化 classes 上使用运算符的两个选择是使用一系列前向声明(我无法开始工作,也不太喜欢这种格式),或者在运算符的第一个声明中实现整个操作;我最终通过使用虚拟调用对象并调用另一个函数来以更简洁的方式执行第二个操作,该函数将执行所需的操作,同时仍然在相同类型下进行模板化。

我不明白为什么我没有看到这种方法几乎呈线性加速,因为每个线程只完成一小部分工作,具体取决于可用内核的数量。这是我使用此操作对不同大小的矩阵执行的一些基准数据:

/*
Single Thread
==========================================

5 Variables
------------------------------------------
100x5 * 5x5: 0.000157889 seconds
1000x5 * 5x5: 0.0010768 seconds
10000x5 * 5x5: 0.010099 seconds
100000x5 * 5x5: 0.112081 seconds
1000000x5 * 5x5: 1.04285 seconds

10 Variables
------------------------------------------
100x10 * 10x10: 0.000224202 seconds
1000x10 * 10x10: 0.00217571 seconds
10000x10 * 10x10: 0.0201944 seconds
100000x10 * 10x10: 0.203912 seconds
1000000x10 * 10x10: 2.04127 seconds

15 Variables
------------------------------------------
100x15 * 15x15: 0.000408143 seconds
1000x15 * 15x15: 0.00398906 seconds
10000x15 * 15x15: 0.0379782 seconds
100000x15 * 15x15: 0.381156 seconds
1000000x15 * 15x15: 3.81325 seconds

20 Variables
------------------------------------------
100x20 * 20x20: 0.000640239 seconds
1000x20 * 20x20: 0.00620069 seconds
10000x20 * 20x20: 0.060218 seconds
100000x20 * 20x20: 0.602554 seconds
1000000x20 * 20x20: 6.00925 seconds


2 Threads
==========================================

5 Variables
------------------------------------------
100x5 * 5x5: 0.000444063 seconds
1000x5 * 5x5: 0.00119759 seconds
10000x5 * 5x5: 0.00975319 seconds
100000x5 * 5x5: 0.09157 seconds
1000000x5 * 5x5: 0.965666 seconds

10 Variables
------------------------------------------
100x10 * 10x10: 0.000593268 seconds
1000x10 * 10x10: 0.00187927 seconds
10000x10 * 10x10: 0.0154861 seconds
100000x10 * 10x10: 0.161186 seconds
1000000x10 * 10x10: 1.5725 seconds

15 Variables
------------------------------------------
100x15 * 15x15: 0.000651292 seconds
1000x15 * 15x15: 0.00425471 seconds
10000x15 * 15x15: 0.0233983 seconds
100000x15 * 15x15: 0.232411 seconds
1000000x15 * 15x15: 2.43293 seconds

20 Variables
------------------------------------------
100x20 * 20x20: 0.000771287 seconds
1000x20 * 20x20: 0.0045547 seconds
10000x20 * 20x20: 0.0342536 seconds
100000x20 * 20x20: 0.381612 seconds
1000000x20 * 20x20: 3.79707 seconds


4 Threads
==========================================

5 Variables
------------------------------------------
100x5 * 5x5: 0.000690369 seconds
1000x5 * 5x5: 0.00120864 seconds
10000x5 * 5x5: 0.00994858 seconds
100000x5 * 5x5: 0.102673 seconds
1000000x5 * 5x5: 0.907731 seconds

10 Variables
------------------------------------------
100x10 * 10x10: 0.000896809 seconds
1000x10 * 10x10: 0.00287674 seconds
10000x10 * 10x10: 0.0177846 seconds
100000x10 * 10x10: 0.161331 seconds
1000000x10 * 10x10: 1.46384 seconds

15 Variables
------------------------------------------
100x15 * 15x15: 0.00100457 seconds
1000x15 * 15x15: 0.00366381 seconds
10000x15 * 15x15: 0.0291613 seconds
100000x15 * 15x15: 0.237525 seconds
1000000x15 * 15x15: 2.23676 seconds

20 Variables
------------------------------------------
100x20 * 20x20: 0.000928781 seconds
1000x20 * 20x20: 0.00486535 seconds
10000x20 * 20x20: 0.0421105 seconds
100000x20 * 20x20: 0.354478 seconds
1000000x20 * 20x20: 3.22576 seconds
*/

任何人都可以提供有关可能导致此问题的任何见解吗?如有任何意见,我们将不胜感激。

您所看到的是线程是多么昂贵。这就是为什么大多数现代框架甚至不使用线程,它们使用线程池,一个已经分配的等待工作的线程列表。大多数还提供对纤程的理论支持,如 .NET,但实际支持从未实现,因为事实证明线程池方法已足够。

结合您在线程中执行的少数操作、上下文切换和各种样板文件,您根本看不到您的方法有多大好处。事实上,大多数矩阵库(Direct3D 等)使用 SIMD(如果我没记错的话是 SSE2),它没有上下文切换惩罚并且设置时间更少(只有两个 movq)。

我想这是一个很酷的学习练习(虽然没有同步),但考虑到其他选择,它总体上是非常不切实际的。