如何使用 OpenMP 提供的 GPU?
How do I use the GPU available with OpenMP?
我正在尝试使用 OpenMP 在 GPU 上获取一些代码 运行,但我没有成功。在我的代码中,我使用 for
循环执行矩阵乘法:一次使用 OpenMP pragma 标记,一次不使用。 (这样我就可以比较执行时间。)在第一个循环之后,我调用 omp_get_num_devices()
(这是我的主要测试,看看我是否真的连接到 GPU。)无论我尝试什么,omp_get_num_devices()
总是 returns 0.
我使用的电脑有两个 NVIDIA Tesla K40M GPU。 CUDA 7.0 和 CUDA 7.5 在计算机上作为模块提供,CUDA 7.5 模块通常处于活动状态。 gcc 4.9.3、5.1.0 和 7.1.0 都可以作为模块使用,gcc 7.1.0 模块通常处于活动状态。我正在用 $ g++ -fopenmp -omptargets=nvptx64sm_35-nvidia-linux ParallelExperimenting.cpp -o ParallelExperimenting
编译我的代码。我已经使用 CPU 成功地并行化了 OpenMP 代码,但没有使用 GPU。
我的主要目标是让 omp_get_num_devices()
到 return 2 作为我可以检测和使用带有 OpenMP 的 GPU 的证据。 任何帮助我在此收到将不胜感激。
这是我用来检查 GPU 是否正确使用的代码:
#include <omp.h>
#include <fstream>
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <time.h>
#include <iomanip>
#include <cstdio>
#include <stdlib.h>
#include <iostream>
#include <time.h>
using namespace std;
double A [501][501];
double B [501][501];
double C [501][501][501];
double D [501][501];
double E [501][501];
double F [501][501][501];
double dummyvar;
int Mapped [501];
int main() {
int i, j, k, l, N, StallerGPU, StallerCPU;
//
N = 500;
// Variables merely uses to make the execution take longer and to
// exaggurate the difference in performance between first and second
// calculation
StallerGPU = 200;
StallerCPU = 200;
std::cout << " N = " << N << "\n";
// generate matrix to be used in first calculation
for (i=0; i<N; i++) {
for (k=0; k<N; k++) {
if (i == k) {
A[i][k] = i+1;
} else {
A[i][k] = i * k / N;
}
}
}
// generate other matrix to be used for the first calculation
for (k=0; k<N; k++) {
for (j=0; j<N; j++) {
B[k][j] = 2*(N-1)-k-j;
}
}
// Slightly adjusted matrices for second calculation
for (i=0; i<N; i++) {
for (k=0; k<N; k++) {
if (i == k) {
D[i][k] = i+2;
} else {
D[i][k] = i * k / N - 1;
}
}
}
for (k=0; k<N; k++) {
for (j=0; j<N; j++) {
E[k][j] = 2*(N+1)-k-j;
}
}
dummyvar = 0;
//Run the multiplication in parallel using GPUs
double diff;
time_t time1;
time1 = time( NULL ); // CPU time counter
cout << endl << " GPU section begins at " << ctime(&time1) << endl;
// This pragma is frequently changed to try different tags
#pragma omp for collapse(4) private(i, j, k, l)
for (i=0; i<N; i++) {
// Mapped[i] = omp_is_initial_device();
for (j=0; j<N; j++) {
for (k=0; k<N; k++) {
for(l = 0; l < StallerGPU; l++ ) {
C[i][j][k] = A[i][k] * B[k][j] ;
dummyvar += A[i][k] * B[k][j] * (l + 1);
}
}
// cout << " i " << i << endl;
}
}
//record the time it took to run the multiplication
time_t time2 = time( NULL );
cout << " number of devices: " << omp_get_num_devices() << endl;
cout << " dummy variable: " << dummyvar << endl;
float cpumin = difftime(time2,time1);
diff = difftime(time2,time1);
cout << " stopping at delta GPU time: " << cpumin << endl;
cout << " terminating at " << ctime(&time2) << endl;
cout << " GPU time elasped " << diff << " s" << endl;
cout << endl;
dummyvar = 0;
time_t time3 = time( NULL );
cout << endl << " CPU section begins at " << ctime(&time3) << endl;
// #pragma omp single
for (i=0; i<N; i++) {
for (j=0; j<N; j++) {
for (k=0; k<N; k++) {
for (int l=0; l<StallerCPU; l++) {
F[i][j][k] = D[i][k] * E[k][j];
dummyvar += D[i][k] * E[k][j] * (l - 1);
}
}
}
}
// the sum to complete the matrix calculation is left out here, but would
// only be used to check if the result of the calculation is correct
time_t time4 = time( NULL );
cpumin = difftime(time4,time3);
diff = difftime(time4,time3);
cout << " dummy variable: " << dummyvar << endl;
cout << " stopping at delta CPU time: " << cpumin << endl;
cout << " terminating at " << ctime(&time4) << endl;
cout << " CPU time elasped " << diff << " s" << endl;
//Compare the time it took to confirm that we actually used GPUs to parallelize.
}
这是 运行ning deviceQuery 示例 CUDA 代码的结果。
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 2 CUDA Capable device(s)
Device 0: "Tesla K40m"
CUDA Driver Version / Runtime Version 7.5 / 7.5
CUDA Capability Major/Minor version number: 3.5
Total amount of global memory: 11520 MBytes (12079136768 bytes)
(15) Multiprocessors, (192) CUDA Cores/MP: 2880 CUDA Cores
GPU Max Clock rate: 745 MHz (0.75 GHz)
Memory Clock rate: 3004 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 1572864 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 130 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 1: "Tesla K40m"
CUDA Driver Version / Runtime Version 7.5 / 7.5
CUDA Capability Major/Minor version number: 3.5
Total amount of global memory: 11520 MBytes (12079136768 bytes)
(15) Multiprocessors, (192) CUDA Cores/MP: 2880 CUDA Cores
GPU Max Clock rate: 745 MHz (0.75 GHz)
Memory Clock rate: 3004 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 1572864 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 131 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from Tesla K40m (GPU0) -> Tesla K40m (GPU1) : Yes
> Peer access from Tesla K40m (GPU1) -> Tesla K40m (GPU0) : Yes
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 2, Device0 = Tesla K40m, Device1 = Tesla K40m
Result = PASS
也许我走错了方向。但我想帮忙,
因为我曾经遇到过使用 GPU 的奇怪情况。
您需要在 linux 的 "video" 组中,才能使用 GPU。
或者GPU返回的结果全部为0。
所以我建议你 运行 一个示例 CUDA 代码来检查你是否处于我以前被卡住的情况。
这很奇怪。我不确定我是否正确描述了它。
希望能帮助到你。
根据这个:https://wiki.gentoo.org/wiki/NVidia/nvidia-drivers
The user(s) needing to access the video card will need to be added to
the video group
我可能错了,但我认为您需要对发布的代码进行一些更正(也许您已经知道)。要使用 OpenMP 在 GPU 目标上实际 运行,您需要替换:
#pragma omp for collapse(4) private(i, j, k, l)
和
#pragma omp target teams distribute parallel for collapse(4) private(i, j, k, l)
您可以通过使用 'nvprof' 分析您的可执行文件来验证内核是否真的 运行ning 在 GPU 上。它应该显示在 GPU 上执行的内核。您还可以使用 'num_teams' 和 'thread_limit' 子句更改目标区域中的团队和线程数量,您应该会在个人资料中看到相应的更改。
为了以编程方式实际检查目标区域是否在目标设备上 运行ning,我使用 'omp_is_initial_device()' 调用,当从加速器调用时 returns 0。这是一个例子:
int A[1] = {-1};
#pragma omp target
{
A[0] = omp_is_initial_device();
}
if (!A[0]) {
printf("Able to use offloading!\n");
}
GCC 4.9.3 和 5.1.0 绝对不支持 OpenMP 卸载到 GPU。
GCC 7.1.0 确实支持它,但是它应该使用特殊的配置选项构建,as described here。
我正在尝试使用 OpenMP 在 GPU 上获取一些代码 运行,但我没有成功。在我的代码中,我使用 for
循环执行矩阵乘法:一次使用 OpenMP pragma 标记,一次不使用。 (这样我就可以比较执行时间。)在第一个循环之后,我调用 omp_get_num_devices()
(这是我的主要测试,看看我是否真的连接到 GPU。)无论我尝试什么,omp_get_num_devices()
总是 returns 0.
我使用的电脑有两个 NVIDIA Tesla K40M GPU。 CUDA 7.0 和 CUDA 7.5 在计算机上作为模块提供,CUDA 7.5 模块通常处于活动状态。 gcc 4.9.3、5.1.0 和 7.1.0 都可以作为模块使用,gcc 7.1.0 模块通常处于活动状态。我正在用 $ g++ -fopenmp -omptargets=nvptx64sm_35-nvidia-linux ParallelExperimenting.cpp -o ParallelExperimenting
编译我的代码。我已经使用 CPU 成功地并行化了 OpenMP 代码,但没有使用 GPU。
我的主要目标是让 omp_get_num_devices()
到 return 2 作为我可以检测和使用带有 OpenMP 的 GPU 的证据。 任何帮助我在此收到将不胜感激。
这是我用来检查 GPU 是否正确使用的代码:
#include <omp.h>
#include <fstream>
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <time.h>
#include <iomanip>
#include <cstdio>
#include <stdlib.h>
#include <iostream>
#include <time.h>
using namespace std;
double A [501][501];
double B [501][501];
double C [501][501][501];
double D [501][501];
double E [501][501];
double F [501][501][501];
double dummyvar;
int Mapped [501];
int main() {
int i, j, k, l, N, StallerGPU, StallerCPU;
//
N = 500;
// Variables merely uses to make the execution take longer and to
// exaggurate the difference in performance between first and second
// calculation
StallerGPU = 200;
StallerCPU = 200;
std::cout << " N = " << N << "\n";
// generate matrix to be used in first calculation
for (i=0; i<N; i++) {
for (k=0; k<N; k++) {
if (i == k) {
A[i][k] = i+1;
} else {
A[i][k] = i * k / N;
}
}
}
// generate other matrix to be used for the first calculation
for (k=0; k<N; k++) {
for (j=0; j<N; j++) {
B[k][j] = 2*(N-1)-k-j;
}
}
// Slightly adjusted matrices for second calculation
for (i=0; i<N; i++) {
for (k=0; k<N; k++) {
if (i == k) {
D[i][k] = i+2;
} else {
D[i][k] = i * k / N - 1;
}
}
}
for (k=0; k<N; k++) {
for (j=0; j<N; j++) {
E[k][j] = 2*(N+1)-k-j;
}
}
dummyvar = 0;
//Run the multiplication in parallel using GPUs
double diff;
time_t time1;
time1 = time( NULL ); // CPU time counter
cout << endl << " GPU section begins at " << ctime(&time1) << endl;
// This pragma is frequently changed to try different tags
#pragma omp for collapse(4) private(i, j, k, l)
for (i=0; i<N; i++) {
// Mapped[i] = omp_is_initial_device();
for (j=0; j<N; j++) {
for (k=0; k<N; k++) {
for(l = 0; l < StallerGPU; l++ ) {
C[i][j][k] = A[i][k] * B[k][j] ;
dummyvar += A[i][k] * B[k][j] * (l + 1);
}
}
// cout << " i " << i << endl;
}
}
//record the time it took to run the multiplication
time_t time2 = time( NULL );
cout << " number of devices: " << omp_get_num_devices() << endl;
cout << " dummy variable: " << dummyvar << endl;
float cpumin = difftime(time2,time1);
diff = difftime(time2,time1);
cout << " stopping at delta GPU time: " << cpumin << endl;
cout << " terminating at " << ctime(&time2) << endl;
cout << " GPU time elasped " << diff << " s" << endl;
cout << endl;
dummyvar = 0;
time_t time3 = time( NULL );
cout << endl << " CPU section begins at " << ctime(&time3) << endl;
// #pragma omp single
for (i=0; i<N; i++) {
for (j=0; j<N; j++) {
for (k=0; k<N; k++) {
for (int l=0; l<StallerCPU; l++) {
F[i][j][k] = D[i][k] * E[k][j];
dummyvar += D[i][k] * E[k][j] * (l - 1);
}
}
}
}
// the sum to complete the matrix calculation is left out here, but would
// only be used to check if the result of the calculation is correct
time_t time4 = time( NULL );
cpumin = difftime(time4,time3);
diff = difftime(time4,time3);
cout << " dummy variable: " << dummyvar << endl;
cout << " stopping at delta CPU time: " << cpumin << endl;
cout << " terminating at " << ctime(&time4) << endl;
cout << " CPU time elasped " << diff << " s" << endl;
//Compare the time it took to confirm that we actually used GPUs to parallelize.
}
这是 运行ning deviceQuery 示例 CUDA 代码的结果。
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 2 CUDA Capable device(s)
Device 0: "Tesla K40m"
CUDA Driver Version / Runtime Version 7.5 / 7.5
CUDA Capability Major/Minor version number: 3.5
Total amount of global memory: 11520 MBytes (12079136768 bytes)
(15) Multiprocessors, (192) CUDA Cores/MP: 2880 CUDA Cores
GPU Max Clock rate: 745 MHz (0.75 GHz)
Memory Clock rate: 3004 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 1572864 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 130 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 1: "Tesla K40m"
CUDA Driver Version / Runtime Version 7.5 / 7.5
CUDA Capability Major/Minor version number: 3.5
Total amount of global memory: 11520 MBytes (12079136768 bytes)
(15) Multiprocessors, (192) CUDA Cores/MP: 2880 CUDA Cores
GPU Max Clock rate: 745 MHz (0.75 GHz)
Memory Clock rate: 3004 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 1572864 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 131 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from Tesla K40m (GPU0) -> Tesla K40m (GPU1) : Yes
> Peer access from Tesla K40m (GPU1) -> Tesla K40m (GPU0) : Yes
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 2, Device0 = Tesla K40m, Device1 = Tesla K40m
Result = PASS
也许我走错了方向。但我想帮忙,
因为我曾经遇到过使用 GPU 的奇怪情况。
您需要在 linux 的 "video" 组中,才能使用 GPU。
或者GPU返回的结果全部为0。
所以我建议你 运行 一个示例 CUDA 代码来检查你是否处于我以前被卡住的情况。
这很奇怪。我不确定我是否正确描述了它。 希望能帮助到你。
根据这个:https://wiki.gentoo.org/wiki/NVidia/nvidia-drivers
The user(s) needing to access the video card will need to be added to the video group
我可能错了,但我认为您需要对发布的代码进行一些更正(也许您已经知道)。要使用 OpenMP 在 GPU 目标上实际 运行,您需要替换:
#pragma omp for collapse(4) private(i, j, k, l)
和
#pragma omp target teams distribute parallel for collapse(4) private(i, j, k, l)
您可以通过使用 'nvprof' 分析您的可执行文件来验证内核是否真的 运行ning 在 GPU 上。它应该显示在 GPU 上执行的内核。您还可以使用 'num_teams' 和 'thread_limit' 子句更改目标区域中的团队和线程数量,您应该会在个人资料中看到相应的更改。
为了以编程方式实际检查目标区域是否在目标设备上 运行ning,我使用 'omp_is_initial_device()' 调用,当从加速器调用时 returns 0。这是一个例子:
int A[1] = {-1};
#pragma omp target
{
A[0] = omp_is_initial_device();
}
if (!A[0]) {
printf("Able to use offloading!\n");
}
GCC 4.9.3 和 5.1.0 绝对不支持 OpenMP 卸载到 GPU。 GCC 7.1.0 确实支持它,但是它应该使用特殊的配置选项构建,as described here。