为什么我的并行 for 循环给出了不同的输出?
Why is my parallelized for loop giving a different output?
我这样声明我的线程:
for (thread_num = 0; thread_num < NUM_THREADS; thread_num++) //for each thread do
pthread_create(&thread_handles[thread_num], NULL, gemver_default, (void*)thread_num); //create and run the thread. The thread will run the gemver_default. The thread_num will be passed as input to the gemver_default().
for (thread_num = 0; thread_num < NUM_THREADS; thread_num++) //for each thread do
pthread_join(thread_handles[thread_num], NULL); //wait for the thread to finish
然后我的 pthread 循环:
unsigned short int gemver_default(void * thread_num) {
long int my_thread_num = (long int)thread_num; //store the input of the function to my_thread_num
int local = P / NUM_THREADS; //the number of array elements that each thread must compute their sqrt
int starting_element = my_thread_num * local; //first array element to be computed by this thread
int ending_element = starting_element + local - 1; //last array element to be computed by this thread
for (i = starting_element; i < ending_element; i++)
for (j = 0; j < local; j++)
A2[i][j] += u1[i] * v1[j] + u2[i] * v2[j];
}
然后我原来的循环:
unsigned short int gemver_default() {
//this is the loop to parallelize
for (int i = 0; i < P; i++)
for (int j = 0; j < P; j++)
A2[i][j] += u1[i] * v1[j] + u2[i] * v2[j];
return 0;
}
我不明白为什么输出不同?
我已经创建了线程,引用了我想要处理的函数,并将其实现到我的旧循环中。
我目前在您的代码中发现了两个小问题:
1。
您正在设置 ending_element = starting_element + local - 1
,
但是在循环中条件是 i < ending_element
.
你应该改变它 ending_element = starting_element + local
,
或者将循环中的条件更改为 i <= ending_element
.
2。
如果 P 可以被 NUM_THREADS
整除而没有余数,则使用 P / NUM_THREADS
可以正常工作,但如果不能,那么您的线程将不会覆盖从 0 到 P 的所有索引。例如,如果 P = 14
和 NUM_THREADS = 5
,然后是 P / NUM_THREADS = 2
,您的线程将只处理索引 0 到 9,忽略索引 10 到 13。
这个问题的解决方法:可以设置local = P / NUM_THREADS + 1
,
并将循环中的条件从 i < ending_element
更改为 (i < ending_element) && (i < P)
.
我这样声明我的线程:
for (thread_num = 0; thread_num < NUM_THREADS; thread_num++) //for each thread do
pthread_create(&thread_handles[thread_num], NULL, gemver_default, (void*)thread_num); //create and run the thread. The thread will run the gemver_default. The thread_num will be passed as input to the gemver_default().
for (thread_num = 0; thread_num < NUM_THREADS; thread_num++) //for each thread do
pthread_join(thread_handles[thread_num], NULL); //wait for the thread to finish
然后我的 pthread 循环:
unsigned short int gemver_default(void * thread_num) {
long int my_thread_num = (long int)thread_num; //store the input of the function to my_thread_num
int local = P / NUM_THREADS; //the number of array elements that each thread must compute their sqrt
int starting_element = my_thread_num * local; //first array element to be computed by this thread
int ending_element = starting_element + local - 1; //last array element to be computed by this thread
for (i = starting_element; i < ending_element; i++)
for (j = 0; j < local; j++)
A2[i][j] += u1[i] * v1[j] + u2[i] * v2[j];
}
然后我原来的循环:
unsigned short int gemver_default() {
//this is the loop to parallelize
for (int i = 0; i < P; i++)
for (int j = 0; j < P; j++)
A2[i][j] += u1[i] * v1[j] + u2[i] * v2[j];
return 0;
}
我不明白为什么输出不同?
我已经创建了线程,引用了我想要处理的函数,并将其实现到我的旧循环中。
我目前在您的代码中发现了两个小问题:
1。
您正在设置 ending_element = starting_element + local - 1
,
但是在循环中条件是 i < ending_element
.
你应该改变它 ending_element = starting_element + local
,
或者将循环中的条件更改为 i <= ending_element
.
2。
如果 P 可以被 NUM_THREADS
整除而没有余数,则使用 P / NUM_THREADS
可以正常工作,但如果不能,那么您的线程将不会覆盖从 0 到 P 的所有索引。例如,如果 P = 14
和 NUM_THREADS = 5
,然后是 P / NUM_THREADS = 2
,您的线程将只处理索引 0 到 9,忽略索引 10 到 13。
这个问题的解决方法:可以设置local = P / NUM_THREADS + 1
,
并将循环中的条件从 i < ending_element
更改为 (i < ending_element) && (i < P)
.