为什么使用 OpenMP 的多线程嵌套 for 循环速度不快？

Question

我用以下 5 个参数编写了一个函数：

文本行数组；
一些单词的数组；
行数（约40k）；
字数（约4-5）；
要创建的线程数（k，尝试了很多值使得 2

该函数在 for 循环中搜索给定的单词，在每次迭代中扫描每一行以查找一个单词（外部 for）。

内部 for 循环：

遍历行；
检查该行是否包含给定的单词；
计算单词在行中的出现次数；
如果在行中找到该词，则打印一条消息；
增加单词在整个文本中的总出现次数及其在行中的出现次数。

我先把它写成一个串行程序，然后尝试使用 OpenMP 将它并行化，并比较每个程序分析整个文本所花费的时间。不幸的是，与串行版本相比，我无法改进并行程序的时间。

为了测量时间，我使用了以下函数：

double ClockCounter;

void start()
{
        ClockCounter = omp_get_wtime();
}

void stop()
{
        ClockCounter = omp_get_wtime() - ClockCounter;
        printf("Elapsed time: %f\n", ClockCounter);
}

以及文本分析功能：

void analyze_text(char **lines, char *words[], size_t nr_lines, int nr_words, int k) {
        int i, total_word_count, word_occurence;
        start();
#       pragma omp parallel for num_threads(k) \
        default(none) shared(total_word_count, word_occurence) private(i, j, nr_words, nr_lines, word_occurence, counter)
        for (i=0; i<nr_words; i++) {
                total_word_count = 0;
                size_t j;
                printf("The word is: %s.\nSEARCHING...\n", words[i]);
//#             pragma omp parallel for num_threads(k) \
//              default(none) shared(total_word_count) private(j, nr_lines, word_occurence, counter)
                for (j=0; j<nr_lines; j++) {
                        char *line = NULL;
                        line = (char *) malloc (strlen(lines[j]) + 1);
                        strcpy(line, lines[j]);
                        word_occurence = 0;
                        if (strstr(line, words[i]) != NULL) {
                                printf("Word found at line %ld, position(s): ", j);
                                char *found_word = NULL;
                                found_word = (char *) malloc (strlen(words[i]) +1);
                                char delim[] = " -";
                                char *ptr = strtok(line, delim);
                                int counter =  0;
                                while (ptr != NULL) {
                                        counter++;
                                        strncpy(found_word, ptr, strlen(words[i]));
                                        found_word[strlen(found_word)] = '[=11=]';
                                        if (strcmp(words[i], found_word) == 0) {
                                                word_occurence++;
                                                printf("%d ", counter);
                                        }
                                        ptr =  strtok(NULL, delim);

                                }
                                printf("\nline[%3zu]: %s\n", j, lines[j]);
                        }
                        total_word_count += word_occurence;
                }
                printf("The word appears %d times in the text in total.\n\n", total_word_count);
        }
        stop();
}

如您所见，我也只尝试并行化内部 for 循环，但我也没有得到任何改进。由于嵌套的 for 循环有点混乱，我不确定在哪里以及如何声明并行 for 指令，也不知道如何正确地将变量设置为共享或私有。

顺序版本看起来完全一样，但没有声明 omp 指令。

顺序版本的时间（~=）：

Elapsed time: 0.025583

并行版本的时间（~=）（k=2、4、6、8，始终相同）：

Elapsed time: 0.025690

有人可以向我解释一下我在这里缺少什么吗？

Answer 1

TL;DR 回答：您的代码受内存限制（大量内存读取或写入，实际上没有 CPU 密集计算）。 CPU 很容易超过可用内存带宽，增加线程数不会提高性能。这可能会发生在您的硬件上，而且令人惊讶的是，这也会发生在 Compiler Explorer 上，但不会发生在我的计算机上。我真的很纳闷

详细答案：

您的代码存在一些问题（数据竞争、内存泄漏、变量在数据子句中多次出现等），因此我执行了以下操作来修复这些错误并使您的代码更快：

在所需的最小范围内使用了变量。它有助于避免数据竞争并帮助编译器进行更好的优化。
更正了 OpenMP 指令。在 i 循环之前，您必须添加以下行：

#pragma omp parallel for num_threads(k) default(none) shared(lines, words, nr_words, nr_lines)

另一种方法是并行化 j 循环。这里为了避免竞争条件，你必须使用 reduction (reduction(+:total_word_count)):

#pragma omp parallel for num_threads(k) default(none) shared(i,lines, words, nr_words, nr_lines) reduction(+:total_word_count)

去掉了耗时的malloc，取而代之的是静态分配：

//Maximum length of a line - modify it as necessary and make sure that no lines are longer than this value
#define MAX_LINE_LEN 1000

...
char line[MAX_LINE_LEN]; 
....
char found_word[MAX_LINE_LEN]; //it may be smaller

strcpy(line, lines[j]); 被移动到随后的 if 语句之后。这非常耗时，并且只有在找到单词时才需要。所以

char *line = NULL;
line = (char *) malloc (strlen(lines[j]) + 1);
strcpy(line, lines[j]);
word_occurence = 0;
if (strstr(line, words[i]) != NULL) {

已更改为

int word_occurence = 0;
if (strstr(lines[j], words[i]) != NULL) {
  char line[MAX_LINE_LEN];                        
  strcpy(line, lines[j]);
  ...
  //Note that the code below is not critical (rarely executed)...

将所有 printf 功能更改为

#ifdef USE_PRINTF
 printf(...);
#endif

因此很容易打开 on/off 打印输出并且更容易测试其性能。请注意，如果使用更多线程，打印输出将混合，因此您必须先收集输出并稍后打印。

请注意，重写代码并首先遍历文本行并在一行中搜索每个单词会好得多。在这种情况下，缓存利用率会更好，从而提高性能。
添加了一个主要功能来创建文本（通过从 5 条预定义行中随机添加行）来测试其性能

代码如下：

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <omp.h>

//Maximum length of a line - modify it as necessary and make sure that no lines are longer than this value
#define MAX_LINE_LEN 1000

//Uncomment this line to see printout
//#define USE_PRINTF

double ClockCounter;
void start()
{
        ClockCounter = omp_get_wtime();
}

void stop()
{
        ClockCounter = omp_get_wtime() - ClockCounter;
        printf("Elapsed time: %f\n", ClockCounter);
}


void analyze_text(char **lines, const char *words[], const size_t nr_lines, const int nr_words, const int k) {
        start();
        #pragma omp parallel for num_threads(k) default(none) shared(lines, words, nr_words, nr_lines) 
        for (int i=0; i<nr_words; i++) {
                int total_word_count = 0;
                #ifdef USE_PRINTF
                printf("The word is: %s.\nSEARCHING...\n", words[i]);
                #endif
                //#pragma omp parallel for num_threads(k) default(none) shared(i,lines, words, nr_words, nr_lines) reduction(+:total_word_count)
                for (size_t j=0; j<nr_lines; j++) {
                        int word_occurence = 0;
                        if (strstr(lines[j], words[i]) != NULL) {
                                char line[MAX_LINE_LEN];                        
                                strcpy(line, lines[j]);
                                #ifdef USE_PRINTF
                                  printf("Word found at line %ld, position(s): ", j);
                                #endif
                                char found_word[MAX_LINE_LEN]; 
                                char delim[] = " -";
                                char *ptr = strtok(line, delim);
                                int counter =  0;
                                while (ptr != NULL) {
                                        counter++;
                                        strncpy(found_word, ptr, strlen(words[i]));
                                        found_word[strlen(found_word)] = '[=16=]';
                                        if (strcmp(words[i], found_word) == 0) {
                                                word_occurence++;
                                                #ifdef USE_PRINTF
                                                printf("%d ", counter);
                                                #endif
                                        }
                                        ptr =  strtok(NULL, delim);

                                }
                                #ifdef USE_PRINTF
                                printf("\nline[%3zu]: %s\n", j, lines[j]);
                                #endif
                        }
                        total_word_count += word_occurence;
                }
                #ifdef USE_PRINTF
                printf("The word appears %d times in the text in total.\n\n", total_word_count);
                #endif
        }
        stop();
}

int main(){    
    const size_t nr_lines=40000;
    char* lines[nr_lines];
    //sample lines, the text have these lines in a random order
    const char* samples[]={
        "xxx yyy zzz qqq sss dddd aaaaa bbbbb cccc dd eeee fff ggg hh iii jjj kkkk llll mmm nnn oo pppp qqq rrrr ssss ttt",
        "one yyy zzz qqq sss dddd aaaaa bbbbb cccc dd eeee fff ggg hh iii jjj kkkk llll mmm nnn oo pppp qqq rrrr ssss ttt",
        "xxx two zzz qqq sss dddd aaaaa bbbbb cccc dd eeee fff ggg hh iii jjj kkkk llll mmm nnn oo pppp qqq rrrr ssss ttt",
        "xxx yyy three qqq sss dddd aaaaa bbbbb cccc dd eeee fff ggg hh iii jjj kkkk llll mmm nnn oo pppp qqq rrrr ssss ttt",
        "xxx yyy zzz four sss dddd aaaaa bbbbb cccc dd eeee fff ggg hh iii jjj kkkk llll mmm nnn oo pppp qqq rrrr ssss ttt",
        "xxx yyy zzz qqq five dddd aaaaa bbbbb cccc dd eeee fff ggg hh iii jjj kkkk llll mmm nnn oo pppp qqq rrrr ssss ttt"
    };


    for(size_t i=0;i<nr_lines;++i){
        unsigned long int rnd=rand()%1000;
        rnd = rnd >= sizeof(samples)/sizeof(samples[0]) ? 0 : rnd;
        lines[i]=(char*) malloc(strlen(samples[rnd])+1);
        strcpy(lines[i],samples[rnd]);

    }
    
    printf("nr_lines=%ld\n",nr_lines);

    const char* words[]={"one", "two", "three", "four","five"};
    const size_t nr_words=sizeof(words)/sizeof(words[0]);

    printf("nr_words=%ld\n",nr_words);

    for(int i=1;i<6;i++)
    {
        printf("Threads used=%d  --- ",i);
        analyze_text(lines, words, nr_lines, nr_words, i);
    }
  // You should free allocated memory here.
  return 0;
}

我测试了代码，首先使用 400000 行并行化了第一个 (i) 循环（注意 40000 太快了，所以我使用的行数比你多 10 倍）和 [=29 上的 5 个字=]，结果是：

nr_lines=400000
nr_words=5
Threads used=1  --- Elapsed time: 0.032551
Threads used=2  --- Elapsed time: 0.028461
Threads used=3  --- Elapsed time: 0.029304
Threads used=4  --- Elapsed time: 0.027002
Threads used=5  --- Elapsed time: 0.026587

在我的笔记本电脑上，使用核心进行缩放要好得多：

gcc 5.c -fopenmp -O3 -mavx2  && ./a.out
nr_lines=400000
nr_words=5
Threads used=1  --- Elapsed time: 0.046173
Threads used=2  --- Elapsed time: 0.028958
Threads used=3  --- Elapsed time: 0.019951
Threads used=4  --- Elapsed time: 0.015462
Threads used=5  --- Elapsed time: 0.020994

我还并行化了第二个循环（在这种情况下，第一个循环没有并行化以避免嵌套并行）。编译器资源管理器的结果：

nr_lines=400000
nr_words=5
Threads used=1  --- Elapsed time: 0.032781
Threads used=2  --- Elapsed time: 0.026288
Threads used=3  --- Elapsed time: 0.026601
Threads used=4  --- Elapsed time: 0.027013
Threads used=5  --- Elapsed time: 0.027567

在我的笔记本电脑上：

gcc 5.c -fopenmp -O3 -mavx2  && ./a.out
nr_lines=400000
nr_words=5
Threads used=1  --- Elapsed time: 0.047092
Threads used=2  --- Elapsed time: 0.024240
Threads used=3  --- Elapsed time: 0.016849
Threads used=4  --- Elapsed time: 0.014475
Threads used=5  --- Elapsed time: 0.012770
Threads used=6  --- Elapsed time: 0.011703
Threads used=7  --- Elapsed time: 0.009543
Threads used=8  --- Elapsed time: 0.012953

你可以看到

在 Compiler Explorer 上获得的结果令人失望。我根本不知道它的原因是什么。它很可能与内存带宽/内存访问的可用通道有关，但我认为高端处理器用于托管 Compiler Explorer。
在我的笔记本电脑上，缩放比例要好得多，特别是如果第二个循环是并行的。这并不奇怪，因为在这种情况下缓存利用率要好得多。因此，我建议您重写您的代码，首先遍历行并在一行中搜索所有单词。

最后说明：如果性能很重要，您很可能可以找到一个并行库，它旨在尽可能高效地进行此类搜索。

为什么使用 OpenMP 的多线程嵌套 for 循环速度不快？

Why is nested for loop not faster with multiple threads using OpenMP?

c

parallel-processing

performance

multithreading

openmp