当阵列达到一定大小时,OpenMP 变得非常慢
OpenMP gets significantly slower when Array reaches certain size
我正在尝试解决并行计算中的家庭作业示例。这是给定的代码片段:
for (int i=0; i < n-1; i++) {
x[i] = (y[i] + x[i+1]) / 7;
}
我们必须分析有关依赖项的代码片段,并尝试并行化给定的片段并使用适当的样本大小对其进行基准测试。我的第一次尝试只是复制 x
并删除这里的反依赖:
double *x_old = malloc(sizeof(*x_old) * n); // the plain "copy" of x
memcpy(x_old, &x[1], (n - 1) * sizeof(*x)); // to prevent i + 1 in loop we start x[1] and copy n - 1 elements
#pragma omp parallel for
for (int i = 0; i < n-1; i++) {
x[i] = (y[i] + x_old[i]) / 7;
}
此代码有效,我检查了数组的串行和并行校验和,它们都相同。对于 100_000_00 - 400_000_000 数组元素,不计算 memcpy()
的加速在 1.6 - 2 之间.但是,如果我进一步增加数组大小(在470_000_000元素附近),加速会显着降低并且在500_000_000数组元素 加速比为 0.375。
这与我使用多少线程无关。
我在 3 台不同的设备上对其进行了基准测试,结果相同(加速方面):
联想 Ideapad 5
台式电脑
并且在我们的集群节点 PC 上
为什么加速比下降了,为什么下降得这么快?为什么在特定大小的数组中会发生这种情况?
我在这里提供了一个最小的程序示例:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <string.h>
#include "omp.h"
void init(double *arr, int size);
double sum(double *arr, int size);
int main(int argc, char **argv){
if(argc != 3){
printf("Wrong use of program! (1) Size of array, (2) num threads\n");
return EXIT_FAILURE;
}
int n = atoi(argv[1]);
int threads = atoi(argv[2]);
omp_set_num_threads(threads);
double *x = malloc(sizeof(*x) * n);
double *y = malloc(sizeof(*y) * n);
init(x, n);
init(y, n);
double start = omp_get_wtime();
for (int i = 0; i < n-1; i++) {
x[i] = (y[i] + x[i+1]) / 7;
}
double end = omp_get_wtime();
double elapsed_s = end - start;
printf("single elapsed: %fs\n", elapsed_s);
double checksum_s = sum(x, n);
printf("single checksum: %f\n", checksum_s);
init(x, n); //reinit
init(y, n); //reinit
double *x_old = malloc(sizeof(*x_old) * n); // the plain "copy" of x
memcpy(x_old, &x[1], (n - 1) * sizeof(*x)); // to prevent i + 1 in loop we start x[1] and copy n - 1 elements
start = omp_get_wtime();
#pragma omp parallel for
for (int i = 0; i < n-1; i++) {
x[i] = (y[i] + x_old[i]) / 7;
}
end = omp_get_wtime();
double elapsed_p = end - start;
printf("omp threads %d Thread(s) elapsed: %fs\n", omp_get_max_threads(), elapsed_p);
double checksum_p = sum(x, n);
printf("omp threads %d Thread(s) single checksum: %f\n", omp_get_max_threads(), checksum_p);
printf("%s\n", fabs((checksum_p - checksum_s)) < __DBL_EPSILON__ ? "OK" : "FAILED");
printf("Speedup: %.3f\n", elapsed_s / elapsed_p);
}
void init(double *arr, int size){
for(int i = 0; i < size; ++i){
arr[i] = 1.0;
}
}
double sum(double *arr, int size){
double sum = 0.0;
for(int i = 0; i < size; ++i){
sum += arr[i];
}
return sum;
}
这里是一个示例基准:
问题肯定来自交换内存。事实上,代码分配了 3 个 500_000_000 项数组,导致每个数组需要 4 GiB 的 RAM,总共需要 12 GiB。问题是两台基准测试机器只有 16 GiB 的可用 RAM,OS 和 运行 应用程序通常已经需要 1-4 GiB 的 RAM。当可用物理内存量变小时,OS 开始在称为交换内存的 space 上移动内存页面,该内存通常映射到比普通 RAM 慢得多的存储设备上。请注意,某些操作系统(如 Windows)已经使用 ZRAM,以便在可用内存量变小时压缩数据。主流 OS(Windows、Linux 和 MAC 还需要一些额外的 space 用于缓冲区(例如存储设备缓冲区、网络等)。分配当内存量变小时,内存页面的速度也会变慢。另请注意,页面仅在第一次接触主流系统时才在 RAM 中分配。有关此的更多信息,请阅读 this article。
分配巨大的 x_old
数组效率不高(时间和 space)。事实上,memcopy 将花费大量时间将数据复制到内存中,并且此操作受到 RAM 吞吐量的限制,RAM 的吞吐量通常较低。可以通过对chunks进行操作来加快运算速度。这个想法是在每个线程中复制一个小数据块(比如说 16 KiB),这样操作就可以在每个内核的 L1 缓存中完成。这意味着 memcpy
的成本将几乎消失,因为 CPU 缓存比 RAM 快得多(延迟和吞吐量)。事实上,notebook/desktop 计算机的 L1 CPU 缓存应该 >100 GiB/core,导致累计吞吐量 >600 GiB,而 RAM 的吞吐量不应超过 40 GiB/s 在实践中(慢 15 倍)。
对于集群节点,由于受NUMA影响,所以稍微复杂一些。 NUMA 架构非常复杂。有关更多信息,我建议您阅读 this article first, then read (including the related links) and 。简而言之,您需要关心页面的分配位置(分配策略)。关键是控制页面上的第一次触摸(在大多数平台上)。
几个问题...
- 并行使用
x_old
(例如x, y, x_old
)使用3个数组而不是2个(例如x, y
)会给内存控制器和缓存带来额外的压力,所以比较是无效。并行版本处于明显的劣势。
- 您只显示了单线程和 8 线程的数据。您的程序受内存 总线 [和缓存] 限制(而不是计算限制)。
- 来自Atomically increment two integers with CAS follow the video link to a talk: https://www.youtube.com/watch?v=1obZeHnAwz4&t=1251
演讲者在那里对类似的事情进行了基准测试。他的结论是任何超过 4 个线程都会淹没内存总线(即实际上 降低 性能)。由于切换线程、缓存未命中和导致 less 引用位置的开销,额外的线程只会增加额外的开销。
- 系统中的其他因素也会影响数字。其他[无关] threads/processes 可以窃取CPU 时间和内存。处理设备和中断的内核开销。为了尽量减少这种影响,每个测试应该 运行 多次 次(例如 10),我们应该使用 minimum 次每个设置。
- 对于短时间段(例如小于一秒)的测试,starting/stopping 线程的开销可能是一个因素。
例如,我只是 运行 你的程序:1000000 8
并得到:
single elapsed: 0.008146s
single checksum: 285715.000000
omp threads 8 Thread(s) elapsed: 0.008198s
omp threads 8 Thread(s) single checksum: 285715.000000
OK
Speedup: 0.994
当我运行它时:1000000 4
,我得到:
single elapsed: 0.008502s
single checksum: 285715.000000
omp threads 4 Thread(s) elapsed: 0.002677s
omp threads 4 Thread(s) single checksum: 285715.000000
OK
Speedup: 3.176
当我 运行 另一次使用:1000000 8
,我得到:
single elapsed: 0.008450s
single checksum: 285715.000000
omp threads 8 Thread(s) elapsed: 0.005630s
omp threads 8 Thread(s) single checksum: 285715.000000
OK
Speedup: 1.501
因此,4 比 8 快。而且,8 中的两个 运行 产生了 [完全] 不同的结果。
更新:
我将程序重构为 运行 如上所述的多重测试:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <string.h>
#include "omp.h"
void init(double *arr, int size);
double sum(double *arr, int size);
double *x;
double *y;
typedef void (*fnc_p)(double *x,double *y,int n);
void
dosingle(double *x,double *y,int n)
{
for (int i = 0; i < n - 1; i++) {
x[i] = (y[i] + x[i + 1]) / 7;
}
}
void
doparallel(double *x,double *y,int n)
{
#pragma omp parallel for
for (int i = 0; i < n - 1; i++) {
x[i] = (y[i] + x[i + 1]) / 7;
}
}
double
timeone(double *x,double *y,int n,fnc_p fnc)
{
init(x, n);
init(y, n);
double start = omp_get_wtime();
fnc(x,y,n);
double end = omp_get_wtime();
double elapsed = end - start;
printf(" elapsed: %fs", elapsed);
double checksum = sum(x, n);
printf(" checksum: %f\n", checksum);
return elapsed;
}
double
timeall(int threads,double *x,double *y,int n,fnc_p fnc,const char *who)
{
omp_set_num_threads(threads);
printf("\n");
printf("threads: %d\n",threads);
double best = 1e9;
for (int irun = 1; irun <= 10; ++irun) {
printf("%s/%d:",who,irun);
double elapsed = timeone(x,y,n,fnc);
if (elapsed < best)
best = elapsed;
}
return best;
}
int
main(int argc, char **argv)
{
if (argc != 3) {
printf("Wrong use of program! (1) Size of array, (2) num threads\n");
return EXIT_FAILURE;
}
int n = atoi(argv[1]);
int threads = atoi(argv[2]);
x = malloc(sizeof(*x) * n);
y = malloc(sizeof(*y) * n);
double elapsed_s = timeall(1,x,y,n,dosingle,"single");
double best_elapsed = 1e9;
int best_nthread = 0;
for (int ithr = 2; ithr <= threads; ++ithr) {
double elapsed_p = timeall(ithr,x,y,n,doparallel,"parallel");
printf("Speedup: %.3f\n", elapsed_s / elapsed_p);
if (elapsed_p < best_elapsed) {
best_elapsed = elapsed_p;
best_nthread = ithr;
}
}
printf("\n");
printf("FINAL %d is best with %fs and speedup %.3f\n",
best_nthread,best_elapsed,elapsed_s / best_elapsed);
#if 0
printf("single elapsed: %fs\n", elapsed_s);
double checksum_s = sum(x, n);
printf("single checksum: %f\n", checksum_s);
init(x, n); // reinit
init(y, n); // reinit
double *x_old = malloc(sizeof(*x_old) * n); // the plain "copy" of x
memcpy(x_old, &x[1], (n - 1) * sizeof(*x)); // to prevent i + 1 in loop we start x[1] and copy n - 1 elements
start = omp_get_wtime();
#pragma omp parallel for
for (int i = 0; i < n - 1; i++) {
x[i] = (y[i] + x_old[i]) / 7;
}
end = omp_get_wtime();
double elapsed_p = end - start;
printf("omp threads %d Thread(s) elapsed: %fs\n", omp_get_max_threads(), elapsed_p);
double checksum_p = sum(x, n);
printf("omp threads %d Thread(s) single checksum: %f\n", omp_get_max_threads(), checksum_p);
printf("%s\n", fabs((checksum_p - checksum_s)) < __DBL_EPSILON__ ? "OK" : "FAILED");
printf("Speedup: %.3f\n", elapsed_s / elapsed_p);
#endif
return 0;
}
void
init(double *arr, int size)
{
for (int i = 0; i < size; ++i) {
arr[i] = 1.0;
}
}
double
sum(double *arr, int size)
{
double sum = 0.0;
for (int i = 0; i < size; ++i) {
sum += arr[i];
}
return sum;
}
这里是测试输出:1000000 16
:
threads: 1
single/1: elapsed: 0.008467s checksum: 285715.000000
single/2: elapsed: 0.008178s checksum: 285715.000000
single/3: elapsed: 0.008460s checksum: 285715.000000
single/4: elapsed: 0.008126s checksum: 285715.000000
single/5: elapsed: 0.008250s checksum: 285715.000000
single/6: elapsed: 0.008206s checksum: 285715.000000
single/7: elapsed: 0.007933s checksum: 285715.000000
single/8: elapsed: 0.008440s checksum: 285715.000000
single/9: elapsed: 0.008267s checksum: 285715.000000
single/10: elapsed: 0.008671s checksum: 285715.000000
threads: 2
parallel/1: elapsed: 0.004318s checksum: 285714.897959
parallel/2: elapsed: 0.004216s checksum: 285714.897959
parallel/3: elapsed: 0.004224s checksum: 285714.897959
parallel/4: elapsed: 0.004223s checksum: 285714.897959
parallel/5: elapsed: 0.004201s checksum: 285714.897959
parallel/6: elapsed: 0.004235s checksum: 285714.897959
parallel/7: elapsed: 0.004276s checksum: 285714.897959
parallel/8: elapsed: 0.004161s checksum: 285714.897959
parallel/9: elapsed: 0.004163s checksum: 285714.897959
parallel/10: elapsed: 0.004188s checksum: 285714.897959
Speedup: 1.906
threads: 3
parallel/1: elapsed: 0.002930s checksum: 285714.795918
parallel/2: elapsed: 0.002824s checksum: 285714.795918
parallel/3: elapsed: 0.002788s checksum: 285714.795918
parallel/4: elapsed: 0.002855s checksum: 285714.795918
parallel/5: elapsed: 0.002816s checksum: 285714.795918
parallel/6: elapsed: 0.002800s checksum: 285714.795918
parallel/7: elapsed: 0.002854s checksum: 285714.795918
parallel/8: elapsed: 0.002794s checksum: 285714.795918
parallel/9: elapsed: 0.002807s checksum: 285714.795918
parallel/10: elapsed: 0.002788s checksum: 285714.795918
Speedup: 2.845
threads: 4
parallel/1: elapsed: 0.002296s checksum: 285714.693877
parallel/2: elapsed: 0.002152s checksum: 285714.693877
parallel/3: elapsed: 0.002153s checksum: 285714.693877
parallel/4: elapsed: 0.002122s checksum: 285714.693877
parallel/5: elapsed: 0.002147s checksum: 285714.693877
parallel/6: elapsed: 0.002135s checksum: 285714.693877
parallel/7: elapsed: 0.002135s checksum: 285714.693877
parallel/8: elapsed: 0.002126s checksum: 285714.693877
parallel/9: elapsed: 0.002159s checksum: 285714.693877
parallel/10: elapsed: 0.002156s checksum: 285714.693877
Speedup: 3.738
threads: 5
parallel/1: elapsed: 0.003495s checksum: 285714.591836
parallel/2: elapsed: 0.003311s checksum: 285714.591836
parallel/3: elapsed: 0.003315s checksum: 285714.591836
parallel/4: elapsed: 0.003933s checksum: 285714.591836
parallel/5: elapsed: 0.003397s checksum: 285714.591836
parallel/6: elapsed: 0.003303s checksum: 285714.591836
parallel/7: elapsed: 0.003306s checksum: 285714.591836
parallel/8: elapsed: 0.003371s checksum: 285714.591836
parallel/9: elapsed: 0.003239s checksum: 285714.591836
parallel/10: elapsed: 0.003326s checksum: 285714.591836
Speedup: 2.449
threads: 6
parallel/1: elapsed: 0.002845s checksum: 285714.489795
parallel/2: elapsed: 0.002766s checksum: 285714.489795
parallel/3: elapsed: 0.002774s checksum: 285714.489795
parallel/4: elapsed: 0.002768s checksum: 285714.489795
parallel/5: elapsed: 0.002771s checksum: 285714.489795
parallel/6: elapsed: 0.002779s checksum: 285714.489795
parallel/7: elapsed: 0.002773s checksum: 285714.489795
parallel/8: elapsed: 0.002778s checksum: 285714.489795
parallel/9: elapsed: 0.002767s checksum: 285714.489795
parallel/10: elapsed: 0.002776s checksum: 285714.489795
Speedup: 2.868
threads: 7
parallel/1: elapsed: 0.002559s checksum: 285714.387755
parallel/2: elapsed: 0.002402s checksum: 285714.387755
parallel/3: elapsed: 0.002395s checksum: 285714.387755
parallel/4: elapsed: 0.002392s checksum: 285714.387755
parallel/5: elapsed: 0.002375s checksum: 285714.387755
parallel/6: elapsed: 0.002434s checksum: 285714.387755
parallel/7: elapsed: 0.002373s checksum: 285714.387755
parallel/8: elapsed: 0.002392s checksum: 285714.387755
parallel/9: elapsed: 0.002423s checksum: 285714.387755
parallel/10: elapsed: 0.002387s checksum: 285714.387755
Speedup: 3.342
threads: 8
parallel/1: elapsed: 0.002231s checksum: 285714.285714
parallel/2: elapsed: 0.002112s checksum: 285714.285714
parallel/3: elapsed: 0.002141s checksum: 285714.285714
parallel/4: elapsed: 0.002151s checksum: 285714.285714
parallel/5: elapsed: 0.002149s checksum: 285714.285714
parallel/6: elapsed: 0.002130s checksum: 285714.285714
parallel/7: elapsed: 0.002132s checksum: 285714.285714
parallel/8: elapsed: 0.002136s checksum: 285714.285714
parallel/9: elapsed: 0.002117s checksum: 285714.285714
parallel/10: elapsed: 0.002130s checksum: 285714.285714
Speedup: 3.756
threads: 9
parallel/1: elapsed: 0.003113s checksum: 285714.183673
parallel/2: elapsed: 0.002705s checksum: 285714.183673
parallel/3: elapsed: 0.002863s checksum: 285714.183673
parallel/4: elapsed: 0.002834s checksum: 285714.285714
parallel/5: elapsed: 0.002743s checksum: 285714.183673
parallel/6: elapsed: 0.002759s checksum: 285714.183673
parallel/7: elapsed: 0.002725s checksum: 285714.285714
parallel/8: elapsed: 0.002851s checksum: 285714.183673
parallel/9: elapsed: 0.002587s checksum: 285714.183673
parallel/10: elapsed: 0.002873s checksum: 285714.183673
Speedup: 3.067
threads: 10
parallel/1: elapsed: 0.002834s checksum: 285714.081632
parallel/2: elapsed: 0.003325s checksum: 285714.183673
parallel/3: elapsed: 0.002662s checksum: 285714.183673
parallel/4: elapsed: 0.002531s checksum: 285714.081632
parallel/5: elapsed: 0.003410s checksum: 285714.081632
parallel/6: elapsed: 0.003395s checksum: 285714.081632
parallel/7: elapsed: 0.003235s checksum: 285714.183673
parallel/8: elapsed: 0.002927s checksum: 285714.081632
parallel/9: elapsed: 0.002515s checksum: 285714.081632
parallel/10: elapsed: 0.002818s checksum: 285714.081632
Speedup: 3.154
threads: 11
parallel/1: elapsed: 0.002993s checksum: 285713.979591
parallel/2: elapsed: 0.002404s checksum: 285714.081632
parallel/3: elapsed: 0.002373s checksum: 285714.081632
parallel/4: elapsed: 0.002457s checksum: 285714.183673
parallel/5: elapsed: 0.003038s checksum: 285714.285714
parallel/6: elapsed: 0.002367s checksum: 285714.081632
parallel/7: elapsed: 0.002365s checksum: 285714.081632
parallel/8: elapsed: 0.002999s checksum: 285714.081632
parallel/9: elapsed: 0.003064s checksum: 285714.183673
parallel/10: elapsed: 0.002399s checksum: 285713.979591
Speedup: 3.354
threads: 12
parallel/1: elapsed: 0.002355s checksum: 285714.081632
parallel/2: elapsed: 0.002515s checksum: 285713.877551
parallel/3: elapsed: 0.002774s checksum: 285713.979591
parallel/4: elapsed: 0.002742s checksum: 285713.877551
parallel/5: elapsed: 0.002773s checksum: 285713.979591
parallel/6: elapsed: 0.002722s checksum: 285713.979591
parallel/7: elapsed: 0.002909s checksum: 285714.183673
parallel/8: elapsed: 0.002785s checksum: 285713.877551
parallel/9: elapsed: 0.002741s checksum: 285713.979591
parallel/10: elapsed: 0.002825s checksum: 285714.081632
Speedup: 3.369
threads: 13
parallel/1: elapsed: 0.002908s checksum: 285713.877551
parallel/2: elapsed: 0.002379s checksum: 285713.979591
parallel/3: elapsed: 0.002554s checksum: 285714.081632
parallel/4: elapsed: 0.002653s checksum: 285713.775510
parallel/5: elapsed: 0.002663s checksum: 285713.979591
parallel/6: elapsed: 0.002602s checksum: 285713.877551
parallel/7: elapsed: 0.002527s checksum: 285713.877551
parallel/8: elapsed: 0.002665s checksum: 285713.979591
parallel/9: elapsed: 0.002766s checksum: 285713.979591
parallel/10: elapsed: 0.002620s checksum: 285713.877551
Speedup: 3.335
threads: 14
parallel/1: elapsed: 0.002617s checksum: 285713.877551
parallel/2: elapsed: 0.002438s checksum: 285713.877551
parallel/3: elapsed: 0.002457s checksum: 285713.979591
parallel/4: elapsed: 0.002503s checksum: 285713.979591
parallel/5: elapsed: 0.002350s checksum: 285713.775510
parallel/6: elapsed: 0.002567s checksum: 285713.877551
parallel/7: elapsed: 0.002492s checksum: 285713.979591
parallel/8: elapsed: 0.002458s checksum: 285713.877551
parallel/9: elapsed: 0.002496s checksum: 285713.979591
parallel/10: elapsed: 0.002547s checksum: 285713.979591
Speedup: 3.376
threads: 15
parallel/1: elapsed: 0.002530s checksum: 285713.979591
parallel/2: elapsed: 0.002396s checksum: 285713.877551
parallel/3: elapsed: 0.002267s checksum: 285713.775510
parallel/4: elapsed: 0.002397s checksum: 285713.775510
parallel/5: elapsed: 0.002280s checksum: 285713.571428
parallel/6: elapsed: 0.002415s checksum: 285713.775510
parallel/7: elapsed: 0.002320s checksum: 285713.775510
parallel/8: elapsed: 0.002266s checksum: 285713.979591
parallel/9: elapsed: 0.002356s checksum: 285713.877551
parallel/10: elapsed: 0.002302s checksum: 285713.979591
Speedup: 3.501
threads: 16
parallel/1: elapsed: 0.002679s checksum: 285713.775510
parallel/2: elapsed: 0.002178s checksum: 285713.775510
parallel/3: elapsed: 0.002171s checksum: 285713.571428
parallel/4: elapsed: 0.002446s checksum: 285713.673469
parallel/5: elapsed: 0.002394s checksum: 285713.673469
parallel/6: elapsed: 0.002475s checksum: 285713.571428
parallel/7: elapsed: 0.002165s checksum: 285713.979591
parallel/8: elapsed: 0.002162s checksum: 285713.673469
parallel/9: elapsed: 0.002222s checksum: 285713.673469
parallel/10: elapsed: 0.002186s checksum: 285713.571428
Speedup: 3.670
FINAL 8 is best with 0.002112s and speedup 3.756
我正在尝试解决并行计算中的家庭作业示例。这是给定的代码片段:
for (int i=0; i < n-1; i++) {
x[i] = (y[i] + x[i+1]) / 7;
}
我们必须分析有关依赖项的代码片段,并尝试并行化给定的片段并使用适当的样本大小对其进行基准测试。我的第一次尝试只是复制 x
并删除这里的反依赖:
double *x_old = malloc(sizeof(*x_old) * n); // the plain "copy" of x
memcpy(x_old, &x[1], (n - 1) * sizeof(*x)); // to prevent i + 1 in loop we start x[1] and copy n - 1 elements
#pragma omp parallel for
for (int i = 0; i < n-1; i++) {
x[i] = (y[i] + x_old[i]) / 7;
}
此代码有效,我检查了数组的串行和并行校验和,它们都相同。对于 100_000_00 - 400_000_000 数组元素,不计算 memcpy()
的加速在 1.6 - 2 之间.但是,如果我进一步增加数组大小(在470_000_000元素附近),加速会显着降低并且在500_000_000数组元素 加速比为 0.375。
这与我使用多少线程无关。
我在 3 台不同的设备上对其进行了基准测试,结果相同(加速方面):
联想 Ideapad 5
台式电脑
并且在我们的集群节点 PC 上
为什么加速比下降了,为什么下降得这么快?为什么在特定大小的数组中会发生这种情况?
我在这里提供了一个最小的程序示例:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <string.h>
#include "omp.h"
void init(double *arr, int size);
double sum(double *arr, int size);
int main(int argc, char **argv){
if(argc != 3){
printf("Wrong use of program! (1) Size of array, (2) num threads\n");
return EXIT_FAILURE;
}
int n = atoi(argv[1]);
int threads = atoi(argv[2]);
omp_set_num_threads(threads);
double *x = malloc(sizeof(*x) * n);
double *y = malloc(sizeof(*y) * n);
init(x, n);
init(y, n);
double start = omp_get_wtime();
for (int i = 0; i < n-1; i++) {
x[i] = (y[i] + x[i+1]) / 7;
}
double end = omp_get_wtime();
double elapsed_s = end - start;
printf("single elapsed: %fs\n", elapsed_s);
double checksum_s = sum(x, n);
printf("single checksum: %f\n", checksum_s);
init(x, n); //reinit
init(y, n); //reinit
double *x_old = malloc(sizeof(*x_old) * n); // the plain "copy" of x
memcpy(x_old, &x[1], (n - 1) * sizeof(*x)); // to prevent i + 1 in loop we start x[1] and copy n - 1 elements
start = omp_get_wtime();
#pragma omp parallel for
for (int i = 0; i < n-1; i++) {
x[i] = (y[i] + x_old[i]) / 7;
}
end = omp_get_wtime();
double elapsed_p = end - start;
printf("omp threads %d Thread(s) elapsed: %fs\n", omp_get_max_threads(), elapsed_p);
double checksum_p = sum(x, n);
printf("omp threads %d Thread(s) single checksum: %f\n", omp_get_max_threads(), checksum_p);
printf("%s\n", fabs((checksum_p - checksum_s)) < __DBL_EPSILON__ ? "OK" : "FAILED");
printf("Speedup: %.3f\n", elapsed_s / elapsed_p);
}
void init(double *arr, int size){
for(int i = 0; i < size; ++i){
arr[i] = 1.0;
}
}
double sum(double *arr, int size){
double sum = 0.0;
for(int i = 0; i < size; ++i){
sum += arr[i];
}
return sum;
}
这里是一个示例基准:
问题肯定来自交换内存。事实上,代码分配了 3 个 500_000_000 项数组,导致每个数组需要 4 GiB 的 RAM,总共需要 12 GiB。问题是两台基准测试机器只有 16 GiB 的可用 RAM,OS 和 运行 应用程序通常已经需要 1-4 GiB 的 RAM。当可用物理内存量变小时,OS 开始在称为交换内存的 space 上移动内存页面,该内存通常映射到比普通 RAM 慢得多的存储设备上。请注意,某些操作系统(如 Windows)已经使用 ZRAM,以便在可用内存量变小时压缩数据。主流 OS(Windows、Linux 和 MAC 还需要一些额外的 space 用于缓冲区(例如存储设备缓冲区、网络等)。分配当内存量变小时,内存页面的速度也会变慢。另请注意,页面仅在第一次接触主流系统时才在 RAM 中分配。有关此的更多信息,请阅读 this article。
分配巨大的 x_old
数组效率不高(时间和 space)。事实上,memcopy 将花费大量时间将数据复制到内存中,并且此操作受到 RAM 吞吐量的限制,RAM 的吞吐量通常较低。可以通过对chunks进行操作来加快运算速度。这个想法是在每个线程中复制一个小数据块(比如说 16 KiB),这样操作就可以在每个内核的 L1 缓存中完成。这意味着 memcpy
的成本将几乎消失,因为 CPU 缓存比 RAM 快得多(延迟和吞吐量)。事实上,notebook/desktop 计算机的 L1 CPU 缓存应该 >100 GiB/core,导致累计吞吐量 >600 GiB,而 RAM 的吞吐量不应超过 40 GiB/s 在实践中(慢 15 倍)。
对于集群节点,由于受NUMA影响,所以稍微复杂一些。 NUMA 架构非常复杂。有关更多信息,我建议您阅读 this article first, then read
几个问题...
- 并行使用
x_old
(例如x, y, x_old
)使用3个数组而不是2个(例如x, y
)会给内存控制器和缓存带来额外的压力,所以比较是无效。并行版本处于明显的劣势。 - 您只显示了单线程和 8 线程的数据。您的程序受内存 总线 [和缓存] 限制(而不是计算限制)。
- 来自Atomically increment two integers with CAS follow the video link to a talk: https://www.youtube.com/watch?v=1obZeHnAwz4&t=1251 演讲者在那里对类似的事情进行了基准测试。他的结论是任何超过 4 个线程都会淹没内存总线(即实际上 降低 性能)。由于切换线程、缓存未命中和导致 less 引用位置的开销,额外的线程只会增加额外的开销。
- 系统中的其他因素也会影响数字。其他[无关] threads/processes 可以窃取CPU 时间和内存。处理设备和中断的内核开销。为了尽量减少这种影响,每个测试应该 运行 多次 次(例如 10),我们应该使用 minimum 次每个设置。
- 对于短时间段(例如小于一秒)的测试,starting/stopping 线程的开销可能是一个因素。
例如,我只是 运行 你的程序:1000000 8
并得到:
single elapsed: 0.008146s
single checksum: 285715.000000
omp threads 8 Thread(s) elapsed: 0.008198s
omp threads 8 Thread(s) single checksum: 285715.000000
OK
Speedup: 0.994
当我运行它时:1000000 4
,我得到:
single elapsed: 0.008502s
single checksum: 285715.000000
omp threads 4 Thread(s) elapsed: 0.002677s
omp threads 4 Thread(s) single checksum: 285715.000000
OK
Speedup: 3.176
当我 运行 另一次使用:1000000 8
,我得到:
single elapsed: 0.008450s
single checksum: 285715.000000
omp threads 8 Thread(s) elapsed: 0.005630s
omp threads 8 Thread(s) single checksum: 285715.000000
OK
Speedup: 1.501
因此,4 比 8 快。而且,8 中的两个 运行 产生了 [完全] 不同的结果。
更新:
我将程序重构为 运行 如上所述的多重测试:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <string.h>
#include "omp.h"
void init(double *arr, int size);
double sum(double *arr, int size);
double *x;
double *y;
typedef void (*fnc_p)(double *x,double *y,int n);
void
dosingle(double *x,double *y,int n)
{
for (int i = 0; i < n - 1; i++) {
x[i] = (y[i] + x[i + 1]) / 7;
}
}
void
doparallel(double *x,double *y,int n)
{
#pragma omp parallel for
for (int i = 0; i < n - 1; i++) {
x[i] = (y[i] + x[i + 1]) / 7;
}
}
double
timeone(double *x,double *y,int n,fnc_p fnc)
{
init(x, n);
init(y, n);
double start = omp_get_wtime();
fnc(x,y,n);
double end = omp_get_wtime();
double elapsed = end - start;
printf(" elapsed: %fs", elapsed);
double checksum = sum(x, n);
printf(" checksum: %f\n", checksum);
return elapsed;
}
double
timeall(int threads,double *x,double *y,int n,fnc_p fnc,const char *who)
{
omp_set_num_threads(threads);
printf("\n");
printf("threads: %d\n",threads);
double best = 1e9;
for (int irun = 1; irun <= 10; ++irun) {
printf("%s/%d:",who,irun);
double elapsed = timeone(x,y,n,fnc);
if (elapsed < best)
best = elapsed;
}
return best;
}
int
main(int argc, char **argv)
{
if (argc != 3) {
printf("Wrong use of program! (1) Size of array, (2) num threads\n");
return EXIT_FAILURE;
}
int n = atoi(argv[1]);
int threads = atoi(argv[2]);
x = malloc(sizeof(*x) * n);
y = malloc(sizeof(*y) * n);
double elapsed_s = timeall(1,x,y,n,dosingle,"single");
double best_elapsed = 1e9;
int best_nthread = 0;
for (int ithr = 2; ithr <= threads; ++ithr) {
double elapsed_p = timeall(ithr,x,y,n,doparallel,"parallel");
printf("Speedup: %.3f\n", elapsed_s / elapsed_p);
if (elapsed_p < best_elapsed) {
best_elapsed = elapsed_p;
best_nthread = ithr;
}
}
printf("\n");
printf("FINAL %d is best with %fs and speedup %.3f\n",
best_nthread,best_elapsed,elapsed_s / best_elapsed);
#if 0
printf("single elapsed: %fs\n", elapsed_s);
double checksum_s = sum(x, n);
printf("single checksum: %f\n", checksum_s);
init(x, n); // reinit
init(y, n); // reinit
double *x_old = malloc(sizeof(*x_old) * n); // the plain "copy" of x
memcpy(x_old, &x[1], (n - 1) * sizeof(*x)); // to prevent i + 1 in loop we start x[1] and copy n - 1 elements
start = omp_get_wtime();
#pragma omp parallel for
for (int i = 0; i < n - 1; i++) {
x[i] = (y[i] + x_old[i]) / 7;
}
end = omp_get_wtime();
double elapsed_p = end - start;
printf("omp threads %d Thread(s) elapsed: %fs\n", omp_get_max_threads(), elapsed_p);
double checksum_p = sum(x, n);
printf("omp threads %d Thread(s) single checksum: %f\n", omp_get_max_threads(), checksum_p);
printf("%s\n", fabs((checksum_p - checksum_s)) < __DBL_EPSILON__ ? "OK" : "FAILED");
printf("Speedup: %.3f\n", elapsed_s / elapsed_p);
#endif
return 0;
}
void
init(double *arr, int size)
{
for (int i = 0; i < size; ++i) {
arr[i] = 1.0;
}
}
double
sum(double *arr, int size)
{
double sum = 0.0;
for (int i = 0; i < size; ++i) {
sum += arr[i];
}
return sum;
}
这里是测试输出:1000000 16
:
threads: 1
single/1: elapsed: 0.008467s checksum: 285715.000000
single/2: elapsed: 0.008178s checksum: 285715.000000
single/3: elapsed: 0.008460s checksum: 285715.000000
single/4: elapsed: 0.008126s checksum: 285715.000000
single/5: elapsed: 0.008250s checksum: 285715.000000
single/6: elapsed: 0.008206s checksum: 285715.000000
single/7: elapsed: 0.007933s checksum: 285715.000000
single/8: elapsed: 0.008440s checksum: 285715.000000
single/9: elapsed: 0.008267s checksum: 285715.000000
single/10: elapsed: 0.008671s checksum: 285715.000000
threads: 2
parallel/1: elapsed: 0.004318s checksum: 285714.897959
parallel/2: elapsed: 0.004216s checksum: 285714.897959
parallel/3: elapsed: 0.004224s checksum: 285714.897959
parallel/4: elapsed: 0.004223s checksum: 285714.897959
parallel/5: elapsed: 0.004201s checksum: 285714.897959
parallel/6: elapsed: 0.004235s checksum: 285714.897959
parallel/7: elapsed: 0.004276s checksum: 285714.897959
parallel/8: elapsed: 0.004161s checksum: 285714.897959
parallel/9: elapsed: 0.004163s checksum: 285714.897959
parallel/10: elapsed: 0.004188s checksum: 285714.897959
Speedup: 1.906
threads: 3
parallel/1: elapsed: 0.002930s checksum: 285714.795918
parallel/2: elapsed: 0.002824s checksum: 285714.795918
parallel/3: elapsed: 0.002788s checksum: 285714.795918
parallel/4: elapsed: 0.002855s checksum: 285714.795918
parallel/5: elapsed: 0.002816s checksum: 285714.795918
parallel/6: elapsed: 0.002800s checksum: 285714.795918
parallel/7: elapsed: 0.002854s checksum: 285714.795918
parallel/8: elapsed: 0.002794s checksum: 285714.795918
parallel/9: elapsed: 0.002807s checksum: 285714.795918
parallel/10: elapsed: 0.002788s checksum: 285714.795918
Speedup: 2.845
threads: 4
parallel/1: elapsed: 0.002296s checksum: 285714.693877
parallel/2: elapsed: 0.002152s checksum: 285714.693877
parallel/3: elapsed: 0.002153s checksum: 285714.693877
parallel/4: elapsed: 0.002122s checksum: 285714.693877
parallel/5: elapsed: 0.002147s checksum: 285714.693877
parallel/6: elapsed: 0.002135s checksum: 285714.693877
parallel/7: elapsed: 0.002135s checksum: 285714.693877
parallel/8: elapsed: 0.002126s checksum: 285714.693877
parallel/9: elapsed: 0.002159s checksum: 285714.693877
parallel/10: elapsed: 0.002156s checksum: 285714.693877
Speedup: 3.738
threads: 5
parallel/1: elapsed: 0.003495s checksum: 285714.591836
parallel/2: elapsed: 0.003311s checksum: 285714.591836
parallel/3: elapsed: 0.003315s checksum: 285714.591836
parallel/4: elapsed: 0.003933s checksum: 285714.591836
parallel/5: elapsed: 0.003397s checksum: 285714.591836
parallel/6: elapsed: 0.003303s checksum: 285714.591836
parallel/7: elapsed: 0.003306s checksum: 285714.591836
parallel/8: elapsed: 0.003371s checksum: 285714.591836
parallel/9: elapsed: 0.003239s checksum: 285714.591836
parallel/10: elapsed: 0.003326s checksum: 285714.591836
Speedup: 2.449
threads: 6
parallel/1: elapsed: 0.002845s checksum: 285714.489795
parallel/2: elapsed: 0.002766s checksum: 285714.489795
parallel/3: elapsed: 0.002774s checksum: 285714.489795
parallel/4: elapsed: 0.002768s checksum: 285714.489795
parallel/5: elapsed: 0.002771s checksum: 285714.489795
parallel/6: elapsed: 0.002779s checksum: 285714.489795
parallel/7: elapsed: 0.002773s checksum: 285714.489795
parallel/8: elapsed: 0.002778s checksum: 285714.489795
parallel/9: elapsed: 0.002767s checksum: 285714.489795
parallel/10: elapsed: 0.002776s checksum: 285714.489795
Speedup: 2.868
threads: 7
parallel/1: elapsed: 0.002559s checksum: 285714.387755
parallel/2: elapsed: 0.002402s checksum: 285714.387755
parallel/3: elapsed: 0.002395s checksum: 285714.387755
parallel/4: elapsed: 0.002392s checksum: 285714.387755
parallel/5: elapsed: 0.002375s checksum: 285714.387755
parallel/6: elapsed: 0.002434s checksum: 285714.387755
parallel/7: elapsed: 0.002373s checksum: 285714.387755
parallel/8: elapsed: 0.002392s checksum: 285714.387755
parallel/9: elapsed: 0.002423s checksum: 285714.387755
parallel/10: elapsed: 0.002387s checksum: 285714.387755
Speedup: 3.342
threads: 8
parallel/1: elapsed: 0.002231s checksum: 285714.285714
parallel/2: elapsed: 0.002112s checksum: 285714.285714
parallel/3: elapsed: 0.002141s checksum: 285714.285714
parallel/4: elapsed: 0.002151s checksum: 285714.285714
parallel/5: elapsed: 0.002149s checksum: 285714.285714
parallel/6: elapsed: 0.002130s checksum: 285714.285714
parallel/7: elapsed: 0.002132s checksum: 285714.285714
parallel/8: elapsed: 0.002136s checksum: 285714.285714
parallel/9: elapsed: 0.002117s checksum: 285714.285714
parallel/10: elapsed: 0.002130s checksum: 285714.285714
Speedup: 3.756
threads: 9
parallel/1: elapsed: 0.003113s checksum: 285714.183673
parallel/2: elapsed: 0.002705s checksum: 285714.183673
parallel/3: elapsed: 0.002863s checksum: 285714.183673
parallel/4: elapsed: 0.002834s checksum: 285714.285714
parallel/5: elapsed: 0.002743s checksum: 285714.183673
parallel/6: elapsed: 0.002759s checksum: 285714.183673
parallel/7: elapsed: 0.002725s checksum: 285714.285714
parallel/8: elapsed: 0.002851s checksum: 285714.183673
parallel/9: elapsed: 0.002587s checksum: 285714.183673
parallel/10: elapsed: 0.002873s checksum: 285714.183673
Speedup: 3.067
threads: 10
parallel/1: elapsed: 0.002834s checksum: 285714.081632
parallel/2: elapsed: 0.003325s checksum: 285714.183673
parallel/3: elapsed: 0.002662s checksum: 285714.183673
parallel/4: elapsed: 0.002531s checksum: 285714.081632
parallel/5: elapsed: 0.003410s checksum: 285714.081632
parallel/6: elapsed: 0.003395s checksum: 285714.081632
parallel/7: elapsed: 0.003235s checksum: 285714.183673
parallel/8: elapsed: 0.002927s checksum: 285714.081632
parallel/9: elapsed: 0.002515s checksum: 285714.081632
parallel/10: elapsed: 0.002818s checksum: 285714.081632
Speedup: 3.154
threads: 11
parallel/1: elapsed: 0.002993s checksum: 285713.979591
parallel/2: elapsed: 0.002404s checksum: 285714.081632
parallel/3: elapsed: 0.002373s checksum: 285714.081632
parallel/4: elapsed: 0.002457s checksum: 285714.183673
parallel/5: elapsed: 0.003038s checksum: 285714.285714
parallel/6: elapsed: 0.002367s checksum: 285714.081632
parallel/7: elapsed: 0.002365s checksum: 285714.081632
parallel/8: elapsed: 0.002999s checksum: 285714.081632
parallel/9: elapsed: 0.003064s checksum: 285714.183673
parallel/10: elapsed: 0.002399s checksum: 285713.979591
Speedup: 3.354
threads: 12
parallel/1: elapsed: 0.002355s checksum: 285714.081632
parallel/2: elapsed: 0.002515s checksum: 285713.877551
parallel/3: elapsed: 0.002774s checksum: 285713.979591
parallel/4: elapsed: 0.002742s checksum: 285713.877551
parallel/5: elapsed: 0.002773s checksum: 285713.979591
parallel/6: elapsed: 0.002722s checksum: 285713.979591
parallel/7: elapsed: 0.002909s checksum: 285714.183673
parallel/8: elapsed: 0.002785s checksum: 285713.877551
parallel/9: elapsed: 0.002741s checksum: 285713.979591
parallel/10: elapsed: 0.002825s checksum: 285714.081632
Speedup: 3.369
threads: 13
parallel/1: elapsed: 0.002908s checksum: 285713.877551
parallel/2: elapsed: 0.002379s checksum: 285713.979591
parallel/3: elapsed: 0.002554s checksum: 285714.081632
parallel/4: elapsed: 0.002653s checksum: 285713.775510
parallel/5: elapsed: 0.002663s checksum: 285713.979591
parallel/6: elapsed: 0.002602s checksum: 285713.877551
parallel/7: elapsed: 0.002527s checksum: 285713.877551
parallel/8: elapsed: 0.002665s checksum: 285713.979591
parallel/9: elapsed: 0.002766s checksum: 285713.979591
parallel/10: elapsed: 0.002620s checksum: 285713.877551
Speedup: 3.335
threads: 14
parallel/1: elapsed: 0.002617s checksum: 285713.877551
parallel/2: elapsed: 0.002438s checksum: 285713.877551
parallel/3: elapsed: 0.002457s checksum: 285713.979591
parallel/4: elapsed: 0.002503s checksum: 285713.979591
parallel/5: elapsed: 0.002350s checksum: 285713.775510
parallel/6: elapsed: 0.002567s checksum: 285713.877551
parallel/7: elapsed: 0.002492s checksum: 285713.979591
parallel/8: elapsed: 0.002458s checksum: 285713.877551
parallel/9: elapsed: 0.002496s checksum: 285713.979591
parallel/10: elapsed: 0.002547s checksum: 285713.979591
Speedup: 3.376
threads: 15
parallel/1: elapsed: 0.002530s checksum: 285713.979591
parallel/2: elapsed: 0.002396s checksum: 285713.877551
parallel/3: elapsed: 0.002267s checksum: 285713.775510
parallel/4: elapsed: 0.002397s checksum: 285713.775510
parallel/5: elapsed: 0.002280s checksum: 285713.571428
parallel/6: elapsed: 0.002415s checksum: 285713.775510
parallel/7: elapsed: 0.002320s checksum: 285713.775510
parallel/8: elapsed: 0.002266s checksum: 285713.979591
parallel/9: elapsed: 0.002356s checksum: 285713.877551
parallel/10: elapsed: 0.002302s checksum: 285713.979591
Speedup: 3.501
threads: 16
parallel/1: elapsed: 0.002679s checksum: 285713.775510
parallel/2: elapsed: 0.002178s checksum: 285713.775510
parallel/3: elapsed: 0.002171s checksum: 285713.571428
parallel/4: elapsed: 0.002446s checksum: 285713.673469
parallel/5: elapsed: 0.002394s checksum: 285713.673469
parallel/6: elapsed: 0.002475s checksum: 285713.571428
parallel/7: elapsed: 0.002165s checksum: 285713.979591
parallel/8: elapsed: 0.002162s checksum: 285713.673469
parallel/9: elapsed: 0.002222s checksum: 285713.673469
parallel/10: elapsed: 0.002186s checksum: 285713.571428
Speedup: 3.670
FINAL 8 is best with 0.002112s and speedup 3.756