在 MATLAB Cuda 中获取随机值以进行二次输出
Getting random values in MATLAB Cuda for secondary output
问题:
我得到两个计算的数组和两个预期的输出
- 右计算输出
- 随机数,旧数,其他数组中的数
我正在使用 MATLAB R2016B 和这个 Coda 版本 + GPU:
CUDADevice with properties:
Name: 'GeForce GT 525M'
Index: 1
ComputeCapability: '2.1'
SupportsDouble: 1
DriverVersion: 8
ToolkitVersion: 7.5000
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [65535 65535 65535]
SIMDWidth: 32
TotalMemory: 1.0737e+09
AvailableMemory: 947929088
MultiprocessorCount: 2
ClockRateKHz: 1200000
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 1
CanMapHostMemory: 1
DeviceSupported: 1
DeviceSelected: 1
我现在将尝试使用 GPU 对两个不同的数组进行加减运算,并将其返回给 MATLAB。
MATLAB代码:
n = 10;
as = [1,1,1];
bs = [10,10,10];
for i = 2:n+1
as(end+1,:) = [i,i,i];
bs(end+1,:) = [10,10,10];
end
as = as *1;
% Load the kernel
cudaFilename = 'add2.cu';
ptxFilename = ['add2.ptx'];
% Check if the files are awareable
if((exist(cudaFilename, 'file') || exist(ptxFilename, 'file')) == 2)
error('CUDA FILES ARE NOT HERE');
end
kernel = parallel.gpu.CUDAKernel( ptxFilename, cudaFilename );
% Make sure we have sufficient blocks to cover all of the locations
kernel.ThreadBlockSize = [kernel.MaxThreadsPerBlock,1,1];
kernel.GridSize = [ceil(n/kernel.MaxThreadsPerBlock),1];
% Call the kernel
outadd = zeros(n,1, 'single' );
outminus = zeros(n,1, 'single' );
[outadd, outminus] = feval( kernel, outadd,outminus, as, bs );
Cuda 片段
#include "cuda_runtime.h"
#include "add_wrapper.hpp"
#include <stdio.h>
__device__ size_t calculateGlobalIndex() {
// Which block are we?
size_t const globalBlockIndex = blockIdx.x + blockIdx.y * gridDim.x;
// Which thread are we within the block?
size_t const localThreadIdx = threadIdx.x + blockDim.x * threadIdx.y;
// How big is each block?
size_t const threadsPerBlock = blockDim.x*blockDim.y;
// Which thread are we overall?
return localThreadIdx + globalBlockIndex*threadsPerBlock;
}
__global__ void addKernel(float *c, float *d, const float *a, const float *b)
{
int i = calculateGlobalIndex();
c[i] = a[i] + b[i];
d[i] = a[i] - b[i];
}
// C = A + B
// D = A - B
void addWithCUDA(float *cpuC,float *cpuD, const float *cpuA, const float *cpuB, const size_t sz)
{
//TODO: add error checking
// choose which GPU to run on
cudaSetDevice(0);
// allocate GPU buffers
float *gpuA, *gpuB, *gpuC, *gpuD;
cudaMalloc((void**)&gpuA, sz*sizeof(float));
cudaMalloc((void**)&gpuB, sz*sizeof(float));
cudaMalloc((void**)&gpuC, sz*sizeof(float));
cudaMalloc((void**)&gpuD, sz*sizeof(float));
cudaCheckErrors("cudaMalloc fail");
// copy input vectors from host memory to GPU buffers
cudaMemcpy(gpuA, cpuA, sz*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(gpuB, cpuB, sz*sizeof(float), cudaMemcpyHostToDevice);
// launch kernel on the GPU with one thread per element
addKernel<<<1,sz>>>(gpuC, gpuD, gpuA, gpuB);
// wait for the kernel to finish
cudaDeviceSynchronize();
// copy output vector from GPU buffer to host memory
cudaMemcpy(cpuC, gpuC, sz*sizeof(float), cudaMemcpyDeviceToHost);
cudaMemcpy(cpuD, gpuD, sz*sizeof(float), cudaMemcpyDeviceToHost);
// cleanup
cudaFree(gpuA);
cudaFree(gpuB);
cudaFree(gpuC);
cudaFree(gpuD);
}
void resetDevice()
{
cudaDeviceReset();
}
[outadd, outminus]
是MATLAB中运行代码后的两个GPU数组对象。
Outadd
总是正确计算,outminus
很少正确,主要包含随机整数或浮点数、零,有时甚至 outadd
的值。
如果我将算术运算的顺序换成另一个算术运算,那么 'outminus' 是不是应该计算正确?
使用@Robert Crovella 提示不必要的线程可能会导致访问错误,我只是为线程添加了一个限制。
MATLAB
[outadd, outminus] = feval( kernel, outadd,outminus, as, bs, n);
CUDA 内核方法
__global__ void addKernel(float *c, float *d, const float *a, const float *b, const float n)
{
int i = calculateGlobalIndex();
if ( i < n ){
c[i] = a[i] + b[i];
d[i] = a[i] - b[i];
}
}
我认为这仍然不是最佳解决方案,因为 GPU 仍然会启动所有线程,即使大多数线程不应该使用那么多资源。
修改妥当后上传到这里
问题: 我得到两个计算的数组和两个预期的输出
- 右计算输出
- 随机数,旧数,其他数组中的数
我正在使用 MATLAB R2016B 和这个 Coda 版本 + GPU:
CUDADevice with properties:
Name: 'GeForce GT 525M'
Index: 1
ComputeCapability: '2.1'
SupportsDouble: 1
DriverVersion: 8
ToolkitVersion: 7.5000
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [65535 65535 65535]
SIMDWidth: 32
TotalMemory: 1.0737e+09
AvailableMemory: 947929088
MultiprocessorCount: 2
ClockRateKHz: 1200000
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 1
CanMapHostMemory: 1
DeviceSupported: 1
DeviceSelected: 1
我现在将尝试使用 GPU 对两个不同的数组进行加减运算,并将其返回给 MATLAB。
MATLAB代码:
n = 10;
as = [1,1,1];
bs = [10,10,10];
for i = 2:n+1
as(end+1,:) = [i,i,i];
bs(end+1,:) = [10,10,10];
end
as = as *1;
% Load the kernel
cudaFilename = 'add2.cu';
ptxFilename = ['add2.ptx'];
% Check if the files are awareable
if((exist(cudaFilename, 'file') || exist(ptxFilename, 'file')) == 2)
error('CUDA FILES ARE NOT HERE');
end
kernel = parallel.gpu.CUDAKernel( ptxFilename, cudaFilename );
% Make sure we have sufficient blocks to cover all of the locations
kernel.ThreadBlockSize = [kernel.MaxThreadsPerBlock,1,1];
kernel.GridSize = [ceil(n/kernel.MaxThreadsPerBlock),1];
% Call the kernel
outadd = zeros(n,1, 'single' );
outminus = zeros(n,1, 'single' );
[outadd, outminus] = feval( kernel, outadd,outminus, as, bs );
Cuda 片段
#include "cuda_runtime.h"
#include "add_wrapper.hpp"
#include <stdio.h>
__device__ size_t calculateGlobalIndex() {
// Which block are we?
size_t const globalBlockIndex = blockIdx.x + blockIdx.y * gridDim.x;
// Which thread are we within the block?
size_t const localThreadIdx = threadIdx.x + blockDim.x * threadIdx.y;
// How big is each block?
size_t const threadsPerBlock = blockDim.x*blockDim.y;
// Which thread are we overall?
return localThreadIdx + globalBlockIndex*threadsPerBlock;
}
__global__ void addKernel(float *c, float *d, const float *a, const float *b)
{
int i = calculateGlobalIndex();
c[i] = a[i] + b[i];
d[i] = a[i] - b[i];
}
// C = A + B
// D = A - B
void addWithCUDA(float *cpuC,float *cpuD, const float *cpuA, const float *cpuB, const size_t sz)
{
//TODO: add error checking
// choose which GPU to run on
cudaSetDevice(0);
// allocate GPU buffers
float *gpuA, *gpuB, *gpuC, *gpuD;
cudaMalloc((void**)&gpuA, sz*sizeof(float));
cudaMalloc((void**)&gpuB, sz*sizeof(float));
cudaMalloc((void**)&gpuC, sz*sizeof(float));
cudaMalloc((void**)&gpuD, sz*sizeof(float));
cudaCheckErrors("cudaMalloc fail");
// copy input vectors from host memory to GPU buffers
cudaMemcpy(gpuA, cpuA, sz*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(gpuB, cpuB, sz*sizeof(float), cudaMemcpyHostToDevice);
// launch kernel on the GPU with one thread per element
addKernel<<<1,sz>>>(gpuC, gpuD, gpuA, gpuB);
// wait for the kernel to finish
cudaDeviceSynchronize();
// copy output vector from GPU buffer to host memory
cudaMemcpy(cpuC, gpuC, sz*sizeof(float), cudaMemcpyDeviceToHost);
cudaMemcpy(cpuD, gpuD, sz*sizeof(float), cudaMemcpyDeviceToHost);
// cleanup
cudaFree(gpuA);
cudaFree(gpuB);
cudaFree(gpuC);
cudaFree(gpuD);
}
void resetDevice()
{
cudaDeviceReset();
}
[outadd, outminus]
是MATLAB中运行代码后的两个GPU数组对象。
Outadd
总是正确计算,outminus
很少正确,主要包含随机整数或浮点数、零,有时甚至 outadd
的值。
如果我将算术运算的顺序换成另一个算术运算,那么 'outminus' 是不是应该计算正确?
使用@Robert Crovella 提示不必要的线程可能会导致访问错误,我只是为线程添加了一个限制。
MATLAB
[outadd, outminus] = feval( kernel, outadd,outminus, as, bs, n);
CUDA 内核方法
__global__ void addKernel(float *c, float *d, const float *a, const float *b, const float n)
{
int i = calculateGlobalIndex();
if ( i < n ){
c[i] = a[i] + b[i];
d[i] = a[i] - b[i];
}
}
我认为这仍然不是最佳解决方案,因为 GPU 仍然会启动所有线程,即使大多数线程不应该使用那么多资源。
修改妥当后上传到这里