写入数组时，最后一个线程比第一个线程执行得慢

Question

我正在尝试优化 Mandelbrot 集生成器，问题是我正在尝试使用 _beginthread() 函数使其成为多线程。我正在解决的计算问题是运行在 2D 平面上设置一个函数，我试图同时运行大约 8 个线程，每个线程计算 2D 平面的一部分（行）数组，但我注意到第一个完成的线程比最后一个完成的线程快得多。这是输出：

Starting thread 0
Starting thread 1
Starting thread 2
Starting thread 3
Starting thread 4
Starting thread 5
Starting thread 6
Starting thread 7
Ending thread   0 - Time taken: 1062ms
Ending thread   7 - Time taken: 1031ms
Ending thread   1 - Time taken: 1610ms
Ending thread   6 - Time taken: 1563ms
Ending thread   2 - Time taken: 10265ms
Ending thread   5 - Time taken: 10219ms
Ending thread   4 - Time taken: 31609ms
Ending thread   3 - Time taken: 31641ms

每个线程都有相同的事情要做，但是数字不同，我不明白为什么我得到那些时间这就是我多线程处理的方式：

#define HEIGHT 4000
#define WIDTH 4000
#define MAX_THREADS 8
int const maxIterations = 150;

int bitmap[HEIGHT][WIDTH];
bool finishedThreads[MAX_THREADS];

void renderRow(void * arg) {
    int startTime = GetTickCount();
    int * threadNumPinter = (int*)arg;
    int threadNum = *threadNumPinter;
    int startRow = threadNum * (HEIGHT / MAX_THREADS);
    for (int y = startRow; y <= startRow+(HEIGHT / MAX_THREADS); y++) {
        for (int x = 0; x <= WIDTH; x++) {
            double xx = (((double)x / (double)WIDTH) * 4.0) - 2.0;
            double yy = (((double)y / (double)HEIGHT) * 4.0) - 2.0;
            bitmap[x][y] = isPartOfSet(xx, yy) * 10;
        }
    }
    threadNum = startRow / (HEIGHT / MAX_THREADS);
    finishedThreads[threadNum] = true;
    cout << "Ending thread " << threadNum << " - Time: " << GetTickCount() - startTime << "ms" << endl;
    _endthread();
}


int main() {
    int startTime = GetTickCount();
    HANDLE hThread;
    HANDLE ghEvents[2];
    DWORD dwThreadID;
    int rowsPerThread = HEIGHT / MAX_THREADS;
    int arg;
    int threadIds[MAX_THREADS];
    for (int i = 0; i < MAX_THREADS; i ++) {
        threadIds[i] = i;
        cout << "Starting thread " << i << endl;
        arg = i;
        _beginthread(renderRow, 0, &threadIds[i]);
        Sleep(10);
    }
    bool done = true;//Wait for all threads to finish
    while (1) {
        for (int i = 0; i < MAX_THREADS; i++){
            if (finishedThreads[i] == false)done = false;
        }
        if (done == true) break;
        else done = true;
        Sleep(20);
    }
    saveBitmap(WIDTH, HEIGHT);
    cout << endl << "Rendered in " << double(GetTickCount() - startTime) / 1000.0 << " seconds" << endl;
    cin.get();
    main();
}

代码显然比这多，但我认为这对问题没有任何影响。我在这里做错了什么？我在 CUDA 上遇到了同样的问题，所以我相信这就是我实现多线程的方式。谢谢

Answer 1

不正确并发使用全局变量的经典示例。

bool finishedThreads[MAX_THREADS];

是全局的，可以从多个线程访问 (written/read)。你不能指望这会起作用。对于您的情况，您甚至不应该使用此变量。相反，您应该等待线程完成事件。

Answer 2

硬编码到 8 线程很糟糕，一些用户的双核笔记本电脑呢？ std::thread::hardware_concurrency.

睡眠很糟糕。您的 spin-loop 绝对不是正确的方法。对不起，说实话。

使用std::thread and use join等待他们完成。更好的是：在其他线程上做除了一个工作项之外的所有工作，在主线程上做一个，然后加入其他的。如果有N个CPU那么你应该创建N-1个线程并在主线程上做一个项目。

既然有更好的标准 C++ 库类，为什么还要使用 Windows-only API？

建议的避免方法`Sleep`

如果简单地等待线程退出是不够的（使用上面提到的join），在更复杂的场景下，那么你应该使用std::mutex, std::unique_lock, and std::condition_variable.

您应该有一个在通知发生时设置为 true 的变量。在等待的代码中，您获取互斥锁，检查该标志，如果未设置，则在条件变量上调用 wait。

在通知其他线程的线程中，获取互斥量，设置我提到的标志变量，在条件变量上使用notify_one or notify_all方法。

看看这个 reference on cppreference。不过，您使用的主要是我已经提到的那些。

Answer 3

在我的回答中，我不会解决 threading/synchronizing 关于缓存的问题或想法 - 请参阅其他 answers/comments。

我的观点是不同的：你写 "Every thread has the same thing to do, but with different numbers"。如果我对 mandelbrot 集的记忆对我有用，那么确定一个点是否是该集的成员（IOW 你的 isPartOfSet 函数的实现，你没有提供）是一个迭代过程。有些点 "bail out" 很快，有些点没有，你必须继续迭代直到你的预定义 maximum-numer-of-iters.

所以我要说的是：通过 "one-large-block-per-thread" 并行化，您的线程所花费的时间明显不同可能是很自然的。

此类问题的解决方案是将问题（即图像）拆分为更小的部分，其大小 而不是 取决于线程数，但是应该根据经验选择 a) 不要太大以防止工作分配不均（如您的示例中的大块）和 b) 不要小到导致过多的组织开销。

所以现在，您有 M 个线程和 N 个工作块（N>>M），并且您需要一个实现让每个线程像

一样在循环中工作

while (worktodo) fetch_a_chunk_of_work_and_do_it ()

这种 producer/consumer 模式是如何实现的——我会留给其他人来描述（或者你 google :-)）

写入数组时，最后一个线程比第一个线程执行得慢

Last threads execute slower than first threads while writing to an array

c++

arrays

multithreading

mandelbrot

multidimensional-array

建议的避免方法`Sleep`

写入数组时，最后一个线程比第一个线程执行得慢

Last threads execute slower than first threads while writing to an array

c++

arrays

multithreading

mandelbrot

multidimensional-array

建议的避免方法Sleep

建议的避免方法`Sleep`