为什么多线程不能提高这个寻找素数的程序的性能？

Question

运行时间差不多，不管线程数多少。我无法弄清楚原因。我知道线程是运行并行的，正如它们应该的那样，但我什至没有很好的猜测为什么没有性能改进。（对于单线程和多线程，找到所有小于 800 万的素数大约需要 21 秒）这是怎么回事？

typedef struct prime_finder_vars {
    long from;
    long to;
    int idx;
} PrimeFinderVars;

int is_prime(long num) {
    int limit = round(sqrt(num));
    for (long i = 2; i <= limit; i++) {
        if (num % i == 0)
            return FALSE;
    }
    return TRUE;
}

void *prime_finder(void *pf) {

    PrimeFinderVars *pf_vars = (PrimeFinderVars *) pf;

    long next_cand = pf_vars->from;
    while (next_cand < pf_vars->to) {
        if (is_prime(next_cand)) {
            ++counts[pf_vars->idx];
        }
        next_cand += 2;
    }
    return pf;
}


int main(void) {

    struct timespec start;
    struct timespec end;
    double start_sec, end_sec, elapsed_sec;
    int sum = 0;

    clock_gettime(CLOCK_REALTIME, &start);

    pthread_t threads[NUM_THREADS];
    PrimeFinderVars vars[NUM_THREADS];

    int slice_size = SEARCH_RANGE / NUM_THREADS;

    for (int i = 0; i < NUM_THREADS; i++) {

        vars[i].from = i * slice_size + 1;
        vars[i].to = (i + 1) * slice_size;
        vars[i].idx = i;

        pthread_create(&threads[i], NULL, prime_finder, &vars[i]);

    }

    for (int i = 0; i < NUM_THREADS; i++) {
        pthread_join(threads[i], NULL);
        sum += counts[i];
    }

    clock_gettime(CLOCK_REALTIME, &end);

    start_sec = start.tv_sec + start.tv_nsec / NANO_PER_SEC;
    end_sec = end.tv_sec + end.tv_nsec / NANO_PER_SEC;
    elapsed_sec = end_sec - start_sec;
}

Answer 1

线程之间的工作负载没有很好地平衡。每个线程必须检查大约相同数量的候选者，但是对于具有较高索引的线程，检查每个候选者比具有较低索引的线程需要更多时间。

我会像这样重写主循环：

for (long candidate = pf_vars->idx;
     candidate < SEARCH_RANGE;
     candidate += NUM_THREADS) {
    if (is_prime (candidate)) {
        ++counts [pf_vars->idx];
    }
}

NUM_THREADS 必须是质数本身才能有效工作。

此外，我怀疑您的代码是否会产生正确的结果，因为如果 pf_vars->from 是偶数，prime_finder 将只检查没有多大意义的偶数候选。

此外，线程运行只有当它们运行在不同的内核上时才会并行。如果线程数远大于核心数，那么性能会下降，因为在多个线程之间切换一个核心也需要一些时间。

Answer 2

这是一个有趣的问题。 Mikhail Vladimirov 所说的一切都是真的，但我决定在我的笔记本电脑上做一些测试，看看我得到了什么。我的笔记本电脑是配备八核 i9 的现代 MacBook pro。我不确定它是否是超线程的，但这是我的结果：

我测试了 1 到 50 之间的线程数和 10,000,000 的搜索范围。

使用一个线程需要将近 11 秒，但是使用 16 个线程这个时间会迅速下降到大约 1.5 秒，并且之后也没有任何改善。

我的结论是

我对 Mikhail 关于线程函数成本的回答的评论是错误的，至少在我的平台上是这样。我发现更多线程没有增加开销
你的线程库有问题。

我认为您可能需要让自己满意，线程确实运行在不同的内核上并行运行。对您的结果的一种解释可能是它们都在争夺相同的 CPU.

为了好玩，我决定尝试分析程序。

每一步都代表另一个核心达到 100%。我不确定为什么具有三个线程的 prt 没有达到 300%，但是您可以看到具有四个线程的 prt 会立即上升到 400%，但会以 100% 的步长下降。这是您将任务分成相等范围并且处理较低数字的线程更快完成的效果。

前16个数据点

Threads Time
1   11.893418
2   7.352520
3   5.117278
4   4.062026
5   3.511605
6   2.892274
7   2.401555
8   2.172573
9   1.910534
10  1.864023
11  1.860944
12  1.369277
13  1.628883
14  1.196646
15  1.626215
16  1.548878

我用来生成测试结果的代码（根据您的代码稍作修改）。

#include <stdio.h>
#include <pthread.h>
#include <math.h>
#include <stdbool.h>

#define SEARCH_RANGE    10000000
#define NANO_PER_SEC    1000000000

typedef struct prime_finder_vars {
    long from;
    long to;
    int* count;
} PrimeFinderVars;

int is_prime(long num) {
    int limit = round(sqrt(num));
    for (long i = 2; i <= limit; i++) {
        if (num % i == 0)
            return false;
    }
    return true;
}


void *prime_finder(void *pf)
{

    PrimeFinderVars *pf_vars = (PrimeFinderVars *) pf;

    long next_cand = pf_vars->from;
    while (next_cand < pf_vars->to)
    {
        if (is_prime(next_cand))
        {
            (*pf_vars->count)++ ;
        }
        next_cand += 2;
    }
    return pf;
}


void trial(int numThreads)
{
    struct timespec start;
    struct timespec end;
    double start_sec, end_sec, elapsed_sec;
    int sum = 0;

    clock_gettime(CLOCK_REALTIME, &start);

    int counts[numThreads];
    pthread_t threads[numThreads];
    PrimeFinderVars vars[numThreads];

    int slice_size = SEARCH_RANGE / numThreads;

    for (int i = 0; i < numThreads; i++)
    {
        counts[i] = 0;
        vars[i].from = i * slice_size + 1;
        vars[i].to = (i + 1) * slice_size;
        vars[i].count = &counts[i];

        pthread_create(&threads[i], NULL, prime_finder, &vars[i]);

    }

    for (int i = 0; i < numThreads; i++)
    {
        pthread_join(threads[i], NULL);
        sum += counts[i];
    }

    clock_gettime(CLOCK_REALTIME, &end);

    start_sec = (double)start.tv_sec + (double)start.tv_nsec / NANO_PER_SEC;
    end_sec = (double)end.tv_sec + (double)end.tv_nsec / NANO_PER_SEC;
    elapsed_sec = end_sec - start_sec;
    printf("%d\t%f\n", numThreads, elapsed_sec);
}

int main()
{
    printf("Threads\tTime\n");
    for (int threads = 1 ; threads <= 50 ; ++threads)
    {
        trial(threads);
    }
}

我在过去一两天进一步研究了这个问题。首先，我很好奇为什么似乎有一条双线计时：在大约 12 个线程之后，运行将花费 1.5 秒或 1 秒。我在上面推测这是因为 Mikhail 提到的错误，所以我绘制了每个线程数的实际答案并发现，虽然答案通常在 664,579 左右，但通常会是这个数字的一半左右，而且毫不奇怪，当答案是一半时真正的答案，对应于两条时间线中较低的一条。

所以我修复了那个错误，双线效果消失了。但是，根据线程的数量，我仍然得到不止一个不同的答案。

之所以这样，是因为还有两个bug。

原始算法无法测试每个范围内的顶部数字。
范围的大小是通过将搜索范围除以线程数来计算的。除非没有余数，否则不会检查搜索范围顶部的数字。

我修复了两个错误并修复了第三个运行。这并没有明显影响时间，但我对使用的每个线程数都得到了相同的答案。

为了比较，我写了一个 Eratosthenes 筛法，并对其进行了计时。使用它和单个线程仅需 0.2 秒 - 大约比最快的内核数快七倍。

我发布了 spreadsheet of the results and there's a git repo of the code。

Answer 3

在您确定 2 和 3 为质数后，所有剩余的质数都采用 6N±1 的形式，因为 N >= 1。为了平衡 K 个线程之间的工作负载，您应该让 K 个线程中的每一个逐步执行它自己的 N 值序列：线程 T 在 1 + T + K * X 上工作，其中对于每个线程，X 序列从 0 向上。如果你有 8 个线程，这意味着线程 0 在 N₀ = { 1, 9, 17, … } 等上工作。这仍然意味着线程 K-1 比线程 0 做更多的工作，因为它正在处理更大的数字，但差异是比水平切片范围时少得多。这意味着您需要为每个线程提供起始编号 S 和线程总数 K，然后该线程会将计数器 x 设置为值 0、1... 并检查素数 6(S + xK)±1.

现在，以此作为创建 multi-threaded 素数查找器的良好基础。这与您的代码密切相关。它确实使用了我的 SOQ 存储库的 SOQ (Stack Overflow Questions) repository on GitHub as files timer.c, timer.h, stderr.c and stderr.h in the src/libsoq sub-directory. It uses function isprime() renamed from IsPrime3B() found in the file isprime.c in the src/Primes sub-directory 中可用的一些代码。该程序 prime-thread.c 来自同一 src/Primes 目录。

/* SO 6438-1942 */
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include "stderr.h"
#include "timer.h"

#define NANO_PER_SEC 1.0E9
enum { NUM_THREADS = 8 };
enum { MAX_NUMBER  = 10000000 };

static size_t counts[NUM_THREADS];

static const unsigned int small_primes[] =
{
     5,  7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47,
    53, 59, 61, 67, 71, 73, 79, 83, 87, 89, 91, 97
};
enum { NUM_SMALL_PRIMES = sizeof(small_primes) / sizeof(small_primes[0]) };

/* IsPrime3B() from isprime.c - renamed is_prime() */
static int is_prime(unsigned number)
{
    if (number <= 1)
        return 0;
    if (number == 2 || number == 3)
        return 1;
    if (number % 2 == 0 || number % 3 == 0)
        return 0;
    for (unsigned i = 0; i < NUM_SMALL_PRIMES; i++)
    {
        if (number == small_primes[i])
            return 1;
        if (number % small_primes[i] == 0)
            return 0;
    }
    /* After 97, the next prime numbers are 101, 103, 107, 109 */
    /*
    ** It would be feasible to start this loop from:
    ** i = (((small_primes[NUM_SMALL_PRIMES - 1] + 1) / 6) + 1) * 6
    */
    for (unsigned i = 102; (i - 1) <= number / (i - 1); i += 6)
    {
        if (number % (i - 1) == 0 || number % (i + 1) == 0)
            return 0;
    }
    return 1;
}

typedef struct prime_finder_vars
{
    unsigned from;
    unsigned to;
    unsigned increment;   /* Number of threads */
    unsigned idx;
} PrimeFinderVars;

static void *prime_finder(void *pf)
{
    PrimeFinderVars *pf_vars = (PrimeFinderVars *) pf;
    printf("Thread %u: from = %u, to = %u, inc = %u\n",
           pf_vars->idx, pf_vars->from, pf_vars->to, pf_vars->increment);

    unsigned next = pf_vars->from;
    while (next < pf_vars->to) {
        unsigned six_n = 6 * next;
        if (is_prime(six_n - 1))
            ++counts[pf_vars->idx];
        if (is_prime(six_n + 1))
            ++counts[pf_vars->idx];
        next += pf_vars->increment;
    }
    printf("Thread %u: done\n", pf_vars->idx);
    return pf;
}

int main(int argc, char **argv)
{
    err_setarg0(argv[0]);
    if (argc != 1)
        err_usage("");
    struct timespec start;
    struct timespec end;
    double start_sec, end_sec, elapsed_sec;
    int sum = 0;
    Clock clk;

    clk_init(&clk);

    clk_start(&clk);
    clock_gettime(CLOCK_REALTIME, &start);

    pthread_t threads[NUM_THREADS];
    PrimeFinderVars vars[NUM_THREADS];

    int max_n = (MAX_NUMBER + 5) / 6;

    for (int i = 0; i < NUM_THREADS; i++)
    {
        vars[i].from = i + 1;
        vars[i].to = max_n;
        vars[i].idx = i;
        vars[i].increment = NUM_THREADS;

        int rc;
        if ((rc = pthread_create(&threads[i], NULL, prime_finder, &vars[i])) != 0)
            err_syserr("failed to create thread %d: ", i);
    }

    for (int i = 0; i < NUM_THREADS; i++)
    {
        pthread_join(threads[i], NULL);
        sum += counts[i];
    }

    clock_gettime(CLOCK_REALTIME, &end);
    clk_stop(&clk);

    start_sec = start.tv_sec + start.tv_nsec / NANO_PER_SEC;
    end_sec = end.tv_sec + end.tv_nsec / NANO_PER_SEC;
    elapsed_sec = end_sec - start_sec;
    printf("Time 1: %.6f\n", elapsed_sec);
    char buffer[32];
    printf("Time 2: %s\n", clk_elapsed_us(&clk, buffer, sizeof(buffer)));

    /* Because 2 and 3 are primes but are not analyzed */
    size_t t_count = 2;
    for (int i = 0; i < NUM_THREADS; i++)
    {
        t_count += counts[i];
        printf("%d: %7zu primes found\n", i, counts[i]);
    }
    printf("Total primes found up to %d = %zu\n", MAX_NUMBER, t_count);

    return 0;
}

示例输出：

$ timecmd -u -- prime-thread
2020-10-16 12:15:05.101785 [PID 75174] prime-thread
Thread 0: from = 1, to = 1666667, inc = 8
Thread 7: from = 8, to = 1666667, inc = 8
Thread 2: from = 3, to = 1666667, inc = 8
Thread 3: from = 4, to = 1666667, inc = 8
Thread 5: from = 6, to = 1666667, inc = 8
Thread 4: from = 5, to = 1666667, inc = 8
Thread 6: from = 7, to = 1666667, inc = 8
Thread 1: from = 2, to = 1666667, inc = 8
Thread 0: done
Thread 6: done
Thread 4: done
Thread 7: done
Thread 3: done
Thread 5: done
Thread 2: done
Thread 1: done
Time 1: 0.231135
Time 2: 0.231135
0:   83090 primes found
1:   83176 primes found
2:   83023 primes found
3:   82996 primes found
4:   83060 primes found
5:   82995 primes found
6:   83179 primes found
7:   83058 primes found
Total primes found up to 10000000 = 664579
2020-10-16 12:15:05.341489 [PID 75174; status 0x0000]  -  0.239704s
$

确实有664,579个小于10,000,000的素数。

请注意，timecmd 程序计算了 prime-thread 的整个运行时间（start-up 和打印），而内部计时仅计算线程创建，运行，以及终止时间。这说明了 8 毫秒的时间差异。（这是我用于计时命令的 home-brew 程序。它与 system-provided time 命令大致相似——但也有很大不同。）

给定一个最多 10,000,000 个素数的列表，计算每个线程应该找到多少个素数是可行的。不过，鉴于总数是正确的，因此不太可能存在问题。

时机

请注意，题目说计算素数达到 8,000,000 需要 21 秒。这段代码用了 0.231 秒来计算素数直到 10,000,000。这表明正在使用的 isprime() 功能不如我使用的功能。

确实，显示的代码是：

int is_prime(long num) {
    int limit = round(sqrt(num));
    for (long i = 2; i <= limit; i++) {
        if (num % i == 0)
            return FALSE;
    }
    return TRUE;
}

这做的工作比必要的多得多。只有一个偶素数 2。此代码检查 4、6、8，...，这些都是平凡的 non-prime。这是必要工作量的两倍。检查 2 然后只检查奇数将是一个显着的改进。检查 2、3，然后匹配 6N±1 的数字给出另一个改进。

即使如此，检查三分之一的数据也只会改善 3 倍。不平衡的工作负载可能是一个更大的因素。在 8 个线程 (0..7) 中，线程 7 在 7,000,000..8,000,000 范围内工作，它比线程 0 在 0..1,000,000 范围内工作的计算量要多得多，尽管它要计算的素数更少计数。

问题没有显示完整的 MCVE（Minimal, Complete, Verifiable Example — 或 MRE 或 SO 现在使用的任何名称）或一个 SSCCE（Short, Self-Contained, Correct Example）。它没有显示正在使用的线程数。

我没有，但也许应该，参数化 prime-thread.c 以采用可变数量的线程和可变范围进行分析（并删除线程调试打印）并查看它改变了多少行为。在我的机器上，更多线程不太可能改善情况；可能线程越少越好。

我有一个程序 primes，它使用埃拉托色尼筛法打印给定范围内的素数。当输出到 (SSD) 文件或 /dev/null 时，它 在大约 0.135 秒内打印 所有 664,579 个质数高达 1000 万。这比 prime-thread 计算素数要快得多。更好的算法带来了很多好处。这个 isprime() 函数对每个候选号码进行大量计算。

从中吸取的两个教训：

算法很重要。
线程并不是加快速度的灵丹妙药。

为什么多线程不能提高这个寻找素数的程序的性能？

Why doesn't multithreading improve performance in this program for finding primes?

c

multithreading

pthreads

pthread-join

时机