如何使用 cuda DevicePtr 作为加速数组
how to use a cuda DevicePtr as an accelerate Array
我正在尝试使用 cuda DevicePtr
(which is called a CUdeviceptr
in CUDA-land) returned from foreign code as an accelerate Array
with accelerate-llvm-ptx。
我在下面编写的代码有些工作:
import Data.Array.Accelerate
(Acc, Array, DIM1, Z(Z), (:.)((:.)), use)
import qualified Data.Array.Accelerate as Acc
import Data.Array.Accelerate.Array.Data
(GArrayData(AD_Float), unsafeIndexArrayData)
import Data.Array.Accelerate.Array.Sugar
(Array(Array), fromElt, toElt)
import Data.Array.Accelerate.Array.Unique
(UniqueArray, newUniqueArray)
import Data.Array.Accelerate.LLVM.PTX (run)
import Foreign.C.Types (CULLong(CULLong))
import Foreign.CUDA.Driver (DevicePtr(DevicePtr))
import Foreign.ForeignPtr (newForeignPtr_)
import Foreign.Ptr (intPtrToPtr)
-- A foreign function that uses cuMemAlloc() and cuMemCpyHtoD() to
-- create data on the GPU. The CUdeviceptr (initialized by cuMemAlloc)
-- is returned from this function. It is a CULLong in Haskell.
--
-- The data on the GPU is just a list of the 10 floats
-- [0.0, 1.0, 2.0, ..., 8.0, 9.0]
foreign import ccall "mytest.h mytestcuda"
cmyTestCuda :: IO CULLong
-- | Convert a 'CULLong' to a 'DevicePtr'.
--
-- A 'CULLong' is the type of a CUDA @CUdeviceptr@. This function
-- converts a raw 'CULLong' into a proper 'DevicePtr' that can be
-- used with the cuda Haskell package.
cullongToDevicePtr :: CULLong -> DevicePtr a
cullongToDevicePtr = DevicePtr . intPtrToPtr . fromIntegral
-- | This function calls 'cmyTestCuda' to get the 'DevicePtr', and
-- wraps that up in an accelerate 'Array'. It then uses this 'Array'
-- in an accelerate computation.
accelerateWithDataFromC :: IO ()
accelerateWithDataFromC = do
res <- cmyTestCuda
let DevicePtr ptrToXs = cullongToDevicePtr res
foreignPtrToXs <- newForeignPtr_ ptrToXs
uniqueArrayXs <- newUniqueArray foreignPtrToXs :: IO (UniqueArray Float)
let arrayDataXs = AD_Float uniqueArrayXs :: GArrayData UniqueArray Float
let shape = Z :. 10 :: DIM1
xs = Array (fromElt shape) arrayDataXs :: Array DIM1 Float
ys = Acc.fromList shape [0,2..18] :: Array DIM1 Float
usedXs = use xs :: Acc (Array DIM1 Float)
usedYs = use ys :: Acc (Array DIM1 Float)
computation = Acc.zipWith (+) usedXs usedYs
zs = run computation
putStrLn $ "zs: " <> show z
编译和运行这个程序时,它正确地打印出结果:
zs: Vector (Z :. 10) [0.0,3.0,6.0,9.0,12.0,15.0,18.0,21.0,24.0,27.0]
但是,通过阅读 accelerate 和 accelerate-llvm-ptx 源代码,似乎 不应该这样工作。
在大多数情况下,加速Array
似乎携带一个指向主机内存中数组数据的指针,而Unique
value to uniquely identify the Array
. When performing Acc
计算,加速会将数组数据从主机内存加载到GPU中根据需要存储内存,并使用 Unique
.
索引的 HashMap
对其进行跟踪
在上面的代码中,我直接使用指向 GPU 数据的指针创建了一个 Array
。这似乎不应该工作,但它似乎在上面的代码中工作。
但是,有些东西不起作用。例如,尝试打印 xs
(我的 Array
带有指向 GPU 数据的指针)失败并出现段错误。这是有道理的,因为 Array
的 Show
实例只是尝试 peek
来自 HOST 指针的数据。这失败了,因为它不是主机指针,而是 GPU 指针:
-- Trying to print xs causes a segfault.
putStrLn $ "xs: " <> show xs
有没有合适的方法获取 CUDA DevicePtr
并直接将其用作加速器 Array
?
实际上,令我惊讶的是上面的方法已经如此有效;我无法复制那个。
这里的一个问题是设备内存与执行上下文隐式关联;一个上下文中的指针在不同的上下文中无效,即使在同一 GPU 上也是如此(除非您在这些上下文之间明确启用对等内存访问)。
所以,这个问题实际上有两个组成部分:
- 以Accelerate理解的方式导入国外数据;和
- 确保后续的 Accelerate 计算是在可以访问此内存的上下文中执行的。
解决方案
这是我们将用于在 GPU 上生成数据的 C 代码:
#include <cuda.h>
#include <stdio.h>
#include <stdlib.h>
CUdeviceptr generate_gpu_data()
{
CUresult status = CUDA_SUCCESS;
CUdeviceptr d_arr;
const int N = 32;
float h_arr[N];
for (int i = 0; i < N; ++i) {
h_arr[i] = (float)i;
}
status = cuMemAlloc(&d_arr, N*sizeof(float));
if (CUDA_SUCCESS != status) {
fprintf(stderr, "cuMemAlloc failed (%d)\n", status);
exit(1);
}
status = cuMemcpyHtoD(d_arr, (void*) h_arr, N*sizeof(float));
if (CUDA_SUCCESS != status) {
fprintf(stderr, "cuMemcpyHtoD failed (%d)\n", status);
exit(1);
}
return d_arr;
}
以及使用它的 Haskell/Accelerate 代码:
{-# LANGUAGE ForeignFunctionInterface #-}
import Data.Array.Accelerate as A
import Data.Array.Accelerate.Array.Sugar as Sugar
import Data.Array.Accelerate.Array.Data as AD
import Data.Array.Accelerate.Array.Remote.LRU as LRU
import Data.Array.Accelerate.LLVM.PTX as PTX
import Data.Array.Accelerate.LLVM.PTX.Foreign as PTX
import Foreign.CUDA.Driver as CUDA
import Text.Printf
main :: IO ()
main = do
-- Initialise CUDA and create an execution context. From this we also create
-- the context that our Accelerate programs will run in.
--
CUDA.initialise []
dev <- CUDA.device 0
ctx <- CUDA.create dev []
ptx <- PTX.createTargetFromContext ctx
-- When created, a context becomes the active context, so when we call the
-- foreign function this is the context that it will be executed within.
--
fp <- c_generate_gpu_data
-- To import this data into Accelerate, we need both the host-side array
-- (typically the only thing we see) and then associate this with the existing
-- device memory (rather than allocating new device memory automatically).
--
-- Note that you are still responsible for freeing the device-side data when
-- you no longer need it.
--
arr@(Array _ ad) <- Sugar.allocateArray (Z :. 32) :: IO (Vector Float)
LRU.insertUnmanaged (ptxMemoryTable ptx) ad fp
-- NOTE: there seems to be a bug where we haven't recorded that the host-side
-- data is dirty, and thus needs to be filled in with values from the GPU _if_
-- those are required on the host. At this point we have the information
-- necessary to do the transfer ourselves, but I guess this should really be
-- fixed...
--
-- CUDA.peekArray 32 fp (AD.ptrsOfArrayData ad)
-- An alternative workaround to the above is this no-op computation (this
-- consumes no additional host or device memory, and executes no kernels).
-- If you never need the values on the host, you could ignore this step.
--
let arr' = PTX.runWith ptx (use arr)
-- We can now use the array as in a regular Accelerate computation. The only
-- restriction is that we need to `run*With`, so that we are running in the
-- context of the foreign memory.
--
let r = PTX.runWith ptx $ A.fold (+) 0 (use arr')
printf "array is: %s\n" (show arr')
printf "sum is: %s\n" (show r)
-- Free the foreign memory (again, it is not managed by Accelerate)
--
CUDA.free fp
foreign import ccall unsafe "generate_gpu_data"
c_generate_gpu_data :: IO (DevicePtr Float)
我正在尝试使用 cuda DevicePtr
(which is called a CUdeviceptr
in CUDA-land) returned from foreign code as an accelerate Array
with accelerate-llvm-ptx。
我在下面编写的代码有些工作:
import Data.Array.Accelerate
(Acc, Array, DIM1, Z(Z), (:.)((:.)), use)
import qualified Data.Array.Accelerate as Acc
import Data.Array.Accelerate.Array.Data
(GArrayData(AD_Float), unsafeIndexArrayData)
import Data.Array.Accelerate.Array.Sugar
(Array(Array), fromElt, toElt)
import Data.Array.Accelerate.Array.Unique
(UniqueArray, newUniqueArray)
import Data.Array.Accelerate.LLVM.PTX (run)
import Foreign.C.Types (CULLong(CULLong))
import Foreign.CUDA.Driver (DevicePtr(DevicePtr))
import Foreign.ForeignPtr (newForeignPtr_)
import Foreign.Ptr (intPtrToPtr)
-- A foreign function that uses cuMemAlloc() and cuMemCpyHtoD() to
-- create data on the GPU. The CUdeviceptr (initialized by cuMemAlloc)
-- is returned from this function. It is a CULLong in Haskell.
--
-- The data on the GPU is just a list of the 10 floats
-- [0.0, 1.0, 2.0, ..., 8.0, 9.0]
foreign import ccall "mytest.h mytestcuda"
cmyTestCuda :: IO CULLong
-- | Convert a 'CULLong' to a 'DevicePtr'.
--
-- A 'CULLong' is the type of a CUDA @CUdeviceptr@. This function
-- converts a raw 'CULLong' into a proper 'DevicePtr' that can be
-- used with the cuda Haskell package.
cullongToDevicePtr :: CULLong -> DevicePtr a
cullongToDevicePtr = DevicePtr . intPtrToPtr . fromIntegral
-- | This function calls 'cmyTestCuda' to get the 'DevicePtr', and
-- wraps that up in an accelerate 'Array'. It then uses this 'Array'
-- in an accelerate computation.
accelerateWithDataFromC :: IO ()
accelerateWithDataFromC = do
res <- cmyTestCuda
let DevicePtr ptrToXs = cullongToDevicePtr res
foreignPtrToXs <- newForeignPtr_ ptrToXs
uniqueArrayXs <- newUniqueArray foreignPtrToXs :: IO (UniqueArray Float)
let arrayDataXs = AD_Float uniqueArrayXs :: GArrayData UniqueArray Float
let shape = Z :. 10 :: DIM1
xs = Array (fromElt shape) arrayDataXs :: Array DIM1 Float
ys = Acc.fromList shape [0,2..18] :: Array DIM1 Float
usedXs = use xs :: Acc (Array DIM1 Float)
usedYs = use ys :: Acc (Array DIM1 Float)
computation = Acc.zipWith (+) usedXs usedYs
zs = run computation
putStrLn $ "zs: " <> show z
编译和运行这个程序时,它正确地打印出结果:
zs: Vector (Z :. 10) [0.0,3.0,6.0,9.0,12.0,15.0,18.0,21.0,24.0,27.0]
但是,通过阅读 accelerate 和 accelerate-llvm-ptx 源代码,似乎 不应该这样工作。
在大多数情况下,加速Array
似乎携带一个指向主机内存中数组数据的指针,而Unique
value to uniquely identify the Array
. When performing Acc
计算,加速会将数组数据从主机内存加载到GPU中根据需要存储内存,并使用 Unique
.
HashMap
对其进行跟踪
在上面的代码中,我直接使用指向 GPU 数据的指针创建了一个 Array
。这似乎不应该工作,但它似乎在上面的代码中工作。
但是,有些东西不起作用。例如,尝试打印 xs
(我的 Array
带有指向 GPU 数据的指针)失败并出现段错误。这是有道理的,因为 Array
的 Show
实例只是尝试 peek
来自 HOST 指针的数据。这失败了,因为它不是主机指针,而是 GPU 指针:
-- Trying to print xs causes a segfault.
putStrLn $ "xs: " <> show xs
有没有合适的方法获取 CUDA DevicePtr
并直接将其用作加速器 Array
?
实际上,令我惊讶的是上面的方法已经如此有效;我无法复制那个。
这里的一个问题是设备内存与执行上下文隐式关联;一个上下文中的指针在不同的上下文中无效,即使在同一 GPU 上也是如此(除非您在这些上下文之间明确启用对等内存访问)。
所以,这个问题实际上有两个组成部分:
- 以Accelerate理解的方式导入国外数据;和
- 确保后续的 Accelerate 计算是在可以访问此内存的上下文中执行的。
解决方案
这是我们将用于在 GPU 上生成数据的 C 代码:
#include <cuda.h>
#include <stdio.h>
#include <stdlib.h>
CUdeviceptr generate_gpu_data()
{
CUresult status = CUDA_SUCCESS;
CUdeviceptr d_arr;
const int N = 32;
float h_arr[N];
for (int i = 0; i < N; ++i) {
h_arr[i] = (float)i;
}
status = cuMemAlloc(&d_arr, N*sizeof(float));
if (CUDA_SUCCESS != status) {
fprintf(stderr, "cuMemAlloc failed (%d)\n", status);
exit(1);
}
status = cuMemcpyHtoD(d_arr, (void*) h_arr, N*sizeof(float));
if (CUDA_SUCCESS != status) {
fprintf(stderr, "cuMemcpyHtoD failed (%d)\n", status);
exit(1);
}
return d_arr;
}
以及使用它的 Haskell/Accelerate 代码:
{-# LANGUAGE ForeignFunctionInterface #-}
import Data.Array.Accelerate as A
import Data.Array.Accelerate.Array.Sugar as Sugar
import Data.Array.Accelerate.Array.Data as AD
import Data.Array.Accelerate.Array.Remote.LRU as LRU
import Data.Array.Accelerate.LLVM.PTX as PTX
import Data.Array.Accelerate.LLVM.PTX.Foreign as PTX
import Foreign.CUDA.Driver as CUDA
import Text.Printf
main :: IO ()
main = do
-- Initialise CUDA and create an execution context. From this we also create
-- the context that our Accelerate programs will run in.
--
CUDA.initialise []
dev <- CUDA.device 0
ctx <- CUDA.create dev []
ptx <- PTX.createTargetFromContext ctx
-- When created, a context becomes the active context, so when we call the
-- foreign function this is the context that it will be executed within.
--
fp <- c_generate_gpu_data
-- To import this data into Accelerate, we need both the host-side array
-- (typically the only thing we see) and then associate this with the existing
-- device memory (rather than allocating new device memory automatically).
--
-- Note that you are still responsible for freeing the device-side data when
-- you no longer need it.
--
arr@(Array _ ad) <- Sugar.allocateArray (Z :. 32) :: IO (Vector Float)
LRU.insertUnmanaged (ptxMemoryTable ptx) ad fp
-- NOTE: there seems to be a bug where we haven't recorded that the host-side
-- data is dirty, and thus needs to be filled in with values from the GPU _if_
-- those are required on the host. At this point we have the information
-- necessary to do the transfer ourselves, but I guess this should really be
-- fixed...
--
-- CUDA.peekArray 32 fp (AD.ptrsOfArrayData ad)
-- An alternative workaround to the above is this no-op computation (this
-- consumes no additional host or device memory, and executes no kernels).
-- If you never need the values on the host, you could ignore this step.
--
let arr' = PTX.runWith ptx (use arr)
-- We can now use the array as in a regular Accelerate computation. The only
-- restriction is that we need to `run*With`, so that we are running in the
-- context of the foreign memory.
--
let r = PTX.runWith ptx $ A.fold (+) 0 (use arr')
printf "array is: %s\n" (show arr')
printf "sum is: %s\n" (show r)
-- Free the foreign memory (again, it is not managed by Accelerate)
--
CUDA.free fp
foreign import ccall unsafe "generate_gpu_data"
c_generate_gpu_data :: IO (DevicePtr Float)