使用 Cuda 在 python 中使用 numba 在 GPU 上创建数组
Creating arrays on the GPU with numba in python using Cuda
我想计算网格中每个点的函数。问题是,如果我在 CPU 侧创建网格,将其传输到 GPU 的操作比实际计算花费的时间更长。我可以在 GPU 端生成网格吗?
下面的代码显示了在 CPU 端创建网格并在 GPU 端评估大部分表达式(我不确定如何让 atan2 在 GPU 上工作,所以我把它留在 CPU 一侧)。我应该提前道歉并说我还在学习这些东西,所以我确信下面的代码有很大的改进空间!
谢谢!
import math
from numba import vectorize, float64
import numpy as np
from time import time
@vectorize([float64(float64,float64,float64,float64)],target='cuda')
def a_cuda(lat1, lon1, lat2, lon2):
return (math.sin(0.008726645 * (lat2 - lat1))**2) + \
math.cos(0.01745329*(lat1)) * math.cos(0.01745329*(lat2)) * (math.sin(0.008726645 * (lon2 - lon1))**2)
def LLA_distance_numba_cuda(lat1, lon1, lat2, lon2):
a = a_cuda(np.ascontiguousarray(lat1), np.ascontiguousarray(lon1),
np.ascontiguousarray(lat2), np.ascontiguousarray(lon2))
return earthdiam_nm * np.arctan2(a,1-a)
# generate a mesh of one million evaluation points
nx, ny = 1000,1000
xv, yv = np.meshgrid(np.linspace(29, 31, nx), np.linspace(99, 101, ny))
X, Y = np.float64(xv.reshape(1,nx*ny).flatten()), np.float64(yv.reshape(1,nx*ny).flatten())
X2,Y2 = np.float64(np.array([30]*nx*ny)),np.float64(np.array([101]*nx*ny))
start = time()
LLA_distance_numba_cuda(X,Y,X2,Y2)
print('{:d} total evaluations in {:.3f} seconds'.format(nx*ny,time()-start))
让我们建立一个性能基准。为 earthdiam_nm
添加定义 (1.0),运行 您的代码在 nvprof
下,我们有:
$ nvprof python t38.py
1000000 total evaluations in 0.581 seconds
(...)
==1973== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 55.58% 11.418ms 4 2.8544ms 2.6974ms 3.3044ms [CUDA memcpy HtoD]
28.59% 5.8727ms 1 5.8727ms 5.8727ms 5.8727ms cudapy::__main__::__vectorized_a_cuda2(Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>)
15.83% 3.2521ms 1 3.2521ms 3.2521ms 3.2521ms [CUDA memcpy DtoH]
(...)
所以在我的特定设置中,"kernel" 本身在我的(小而慢的)QuadroK2000 GPU 上运行大约 5.8 毫秒,并且来自主机的 4 个副本的数据复制时间总共为 11.4 毫秒到设备和 3.2ms 将结果传输回主机。重点是从主机到设备的4个副本。
让我们先去寻找唾手可得的果实吧。这行代码:
X2,Y2 = np.float64(np.array([30]*nx*ny)),np.float64(np.array([101]*nx*ny))
除了将值 30 和 101 传递给每个 "worker" 之外,实际上并没有做任何事情。我在这里使用 "worker" 来指代跨大型数据集的 "broadcasting" vectorize
函数的 numba 过程中特定标量计算的想法。 numba vectorize/broadcast 过程不需要每个输入都是相同大小的数据集,只需要提供的数据是 "broadcast"-able 即可。因此,可以创建一个适用于数组和标量的 vectorize
ufunc,例如。这意味着每个 worker 将使用其数组元素和标量来执行其计算。
因此,容易实现的目标是简单地删除这两个数组并将值 (30, 101) 作为标量传递给 ufunc a_cuda
。在我们追求 "low hanging fruit" 的同时,让我们将您的 arctan2
计算(替换为 math.atan2
)和最终 earthdiam_nm
的缩放合并到向量化代码中,因此我们没有在 python/numpy:
的主机上执行
$ cat t39.py
import math
from numba import vectorize, float64
import numpy as np
from time import time
earthdiam_nm = 1.0
@vectorize([float64(float64,float64,float64,float64,float64)],target='cuda')
def a_cuda(lat1, lon1, lat2, lon2, s):
a = (math.sin(0.008726645 * (lat2 - lat1))**2) + \
math.cos(0.01745329*(lat1)) * math.cos(0.01745329*(lat2)) * (math.sin(0.008726645 * (lon2 - lon1))**2)
return math.atan2(a, 1-a)*s
def LLA_distance_numba_cuda(lat1, lon1, lat2, lon2):
return a_cuda(np.ascontiguousarray(lat1), np.ascontiguousarray(lon1),
np.ascontiguousarray(lat2), np.ascontiguousarray(lon2), earthdiam_nm)
# generate a mesh of one million evaluation points
nx, ny = 1000,1000
xv, yv = np.meshgrid(np.linspace(29, 31, nx), np.linspace(99, 101, ny))
X, Y = np.float64(xv.reshape(1,nx*ny).flatten()), np.float64(yv.reshape(1,nx*ny).flatten())
# X2,Y2 = np.float64(np.array([30]*nx*ny)),np.float64(np.array([101]*nx*ny))
start = time()
Z=LLA_distance_numba_cuda(X,Y,30.0,101.0)
print('{:d} total evaluations in {:.3f} seconds'.format(nx*ny,time()-start))
#print(Z)
$ nvprof python t39.py
==2387== NVPROF is profiling process 2387, command: python t39.py
1000000 total evaluations in 0.401 seconds
==2387== Profiling application: python t39.py
==2387== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 48.12% 8.4679ms 1 8.4679ms 8.4679ms 8.4679ms cudapy::__main__::__vectorized_a_cuda2(Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>)
33.97% 5.9774ms 5 1.1955ms 864ns 3.2535ms [CUDA memcpy HtoD]
17.91% 3.1511ms 4 787.77us 1.1840us 3.1459ms [CUDA memcpy DtoH]
(snip)
现在我们看到复制 HtoD 操作已从总计 11.4 毫秒减少到总计 5.6 毫秒。内核已从 ~5.8ms 增长到 ~8.5ms,因为我们在内核中做了更多工作,但 python 报告的函数执行时间已从 ~0.58s 下降到 ~0.4s。
我们可以做得更好吗?
我们可以,但为了做到这一点(我相信)我们需要使用不同的 numba cuda 方法。 vectorize
方法对于标量元素操作很方便,但它无法知道操作在整个数据集中的哪个位置进行。我们需要这些信息,我们可以在 CUDA 代码中获取它,但我们需要切换到 @cuda.jit
装饰器才能这样做。
下面的代码将之前的vectorize
a_cuda
函数转换成@cuda.jit
设备函数(基本上没有其他变化),然后我们创建一个CUDA内核做mesh根据提供的标量参数生成,并计算结果:
$ cat t40.py
import math
from numba import vectorize, float64, cuda
import numpy as np
from time import time
earthdiam_nm = 1.0
@cuda.jit(device='true')
def a_cuda(lat1, lon1, lat2, lon2, s):
a = (math.sin(0.008726645 * (lat2 - lat1))**2) + \
math.cos(0.01745329*(lat1)) * math.cos(0.01745329*(lat2)) * (math.sin(0.008726645 * (lon2 - lon1))**2)
return math.atan2(a, 1-a)*s
@cuda.jit
def LLA_distance_numba_cuda(lat2, lon2, xb, xe, yb, ye, s, nx, ny, out):
x,y = cuda.grid(2)
if x < nx and y < ny:
lat1 = (((xe-xb) * x)/(nx-1)) + xb # mesh generation
lon1 = (((ye-yb) * y)/(ny-1)) + yb # mesh generation
out[y][x] = a_cuda(lat1, lon1, lat2, lon2, s)
nx, ny = 1000,1000
Z = cuda.device_array((nx,ny), dtype=np.float64)
threads = (32,32)
blocks = (32,32)
start = time()
LLA_distance_numba_cuda[blocks,threads](30.0,101.0, 29.0, 31.0, 99.0, 101.0, earthdiam_nm, nx, ny, Z)
Zh = Z.copy_to_host()
print('{:d} total evaluations in {:.3f} seconds'.format(nx*ny,time()-start))
#print(Zh)
$ nvprof python t40.py
==2855== NVPROF is profiling process 2855, command: python t40.py
1000000 total evaluations in 0.294 seconds
==2855== Profiling application: python t40.py
==2855== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 75.60% 10.364ms 1 10.364ms 10.364ms 10.364ms cudapy::__main__::LLA_distance_numba_cuda1(double, double, double, double, double, double, double, __int64, __int64, Array<double, int=2, A, mutable, aligned>)
24.40% 3.3446ms 1 3.3446ms 3.3446ms 3.3446ms [CUDA memcpy DtoH]
(...)
现在我们看到:
- 内核运行时间甚至更长,约为 10 毫秒(因为我们正在生成网格)
- 没有从主机到设备的明确数据复制
- 整体函数运行时间已从 ~0.4s 减少到 ~0.3s
我想计算网格中每个点的函数。问题是,如果我在 CPU 侧创建网格,将其传输到 GPU 的操作比实际计算花费的时间更长。我可以在 GPU 端生成网格吗?
下面的代码显示了在 CPU 端创建网格并在 GPU 端评估大部分表达式(我不确定如何让 atan2 在 GPU 上工作,所以我把它留在 CPU 一侧)。我应该提前道歉并说我还在学习这些东西,所以我确信下面的代码有很大的改进空间!
谢谢!
import math
from numba import vectorize, float64
import numpy as np
from time import time
@vectorize([float64(float64,float64,float64,float64)],target='cuda')
def a_cuda(lat1, lon1, lat2, lon2):
return (math.sin(0.008726645 * (lat2 - lat1))**2) + \
math.cos(0.01745329*(lat1)) * math.cos(0.01745329*(lat2)) * (math.sin(0.008726645 * (lon2 - lon1))**2)
def LLA_distance_numba_cuda(lat1, lon1, lat2, lon2):
a = a_cuda(np.ascontiguousarray(lat1), np.ascontiguousarray(lon1),
np.ascontiguousarray(lat2), np.ascontiguousarray(lon2))
return earthdiam_nm * np.arctan2(a,1-a)
# generate a mesh of one million evaluation points
nx, ny = 1000,1000
xv, yv = np.meshgrid(np.linspace(29, 31, nx), np.linspace(99, 101, ny))
X, Y = np.float64(xv.reshape(1,nx*ny).flatten()), np.float64(yv.reshape(1,nx*ny).flatten())
X2,Y2 = np.float64(np.array([30]*nx*ny)),np.float64(np.array([101]*nx*ny))
start = time()
LLA_distance_numba_cuda(X,Y,X2,Y2)
print('{:d} total evaluations in {:.3f} seconds'.format(nx*ny,time()-start))
让我们建立一个性能基准。为 earthdiam_nm
添加定义 (1.0),运行 您的代码在 nvprof
下,我们有:
$ nvprof python t38.py
1000000 total evaluations in 0.581 seconds
(...)
==1973== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 55.58% 11.418ms 4 2.8544ms 2.6974ms 3.3044ms [CUDA memcpy HtoD]
28.59% 5.8727ms 1 5.8727ms 5.8727ms 5.8727ms cudapy::__main__::__vectorized_a_cuda2(Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>)
15.83% 3.2521ms 1 3.2521ms 3.2521ms 3.2521ms [CUDA memcpy DtoH]
(...)
所以在我的特定设置中,"kernel" 本身在我的(小而慢的)QuadroK2000 GPU 上运行大约 5.8 毫秒,并且来自主机的 4 个副本的数据复制时间总共为 11.4 毫秒到设备和 3.2ms 将结果传输回主机。重点是从主机到设备的4个副本。
让我们先去寻找唾手可得的果实吧。这行代码:
X2,Y2 = np.float64(np.array([30]*nx*ny)),np.float64(np.array([101]*nx*ny))
除了将值 30 和 101 传递给每个 "worker" 之外,实际上并没有做任何事情。我在这里使用 "worker" 来指代跨大型数据集的 "broadcasting" vectorize
函数的 numba 过程中特定标量计算的想法。 numba vectorize/broadcast 过程不需要每个输入都是相同大小的数据集,只需要提供的数据是 "broadcast"-able 即可。因此,可以创建一个适用于数组和标量的 vectorize
ufunc,例如。这意味着每个 worker 将使用其数组元素和标量来执行其计算。
因此,容易实现的目标是简单地删除这两个数组并将值 (30, 101) 作为标量传递给 ufunc a_cuda
。在我们追求 "low hanging fruit" 的同时,让我们将您的 arctan2
计算(替换为 math.atan2
)和最终 earthdiam_nm
的缩放合并到向量化代码中,因此我们没有在 python/numpy:
$ cat t39.py
import math
from numba import vectorize, float64
import numpy as np
from time import time
earthdiam_nm = 1.0
@vectorize([float64(float64,float64,float64,float64,float64)],target='cuda')
def a_cuda(lat1, lon1, lat2, lon2, s):
a = (math.sin(0.008726645 * (lat2 - lat1))**2) + \
math.cos(0.01745329*(lat1)) * math.cos(0.01745329*(lat2)) * (math.sin(0.008726645 * (lon2 - lon1))**2)
return math.atan2(a, 1-a)*s
def LLA_distance_numba_cuda(lat1, lon1, lat2, lon2):
return a_cuda(np.ascontiguousarray(lat1), np.ascontiguousarray(lon1),
np.ascontiguousarray(lat2), np.ascontiguousarray(lon2), earthdiam_nm)
# generate a mesh of one million evaluation points
nx, ny = 1000,1000
xv, yv = np.meshgrid(np.linspace(29, 31, nx), np.linspace(99, 101, ny))
X, Y = np.float64(xv.reshape(1,nx*ny).flatten()), np.float64(yv.reshape(1,nx*ny).flatten())
# X2,Y2 = np.float64(np.array([30]*nx*ny)),np.float64(np.array([101]*nx*ny))
start = time()
Z=LLA_distance_numba_cuda(X,Y,30.0,101.0)
print('{:d} total evaluations in {:.3f} seconds'.format(nx*ny,time()-start))
#print(Z)
$ nvprof python t39.py
==2387== NVPROF is profiling process 2387, command: python t39.py
1000000 total evaluations in 0.401 seconds
==2387== Profiling application: python t39.py
==2387== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 48.12% 8.4679ms 1 8.4679ms 8.4679ms 8.4679ms cudapy::__main__::__vectorized_a_cuda2(Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>)
33.97% 5.9774ms 5 1.1955ms 864ns 3.2535ms [CUDA memcpy HtoD]
17.91% 3.1511ms 4 787.77us 1.1840us 3.1459ms [CUDA memcpy DtoH]
(snip)
现在我们看到复制 HtoD 操作已从总计 11.4 毫秒减少到总计 5.6 毫秒。内核已从 ~5.8ms 增长到 ~8.5ms,因为我们在内核中做了更多工作,但 python 报告的函数执行时间已从 ~0.58s 下降到 ~0.4s。
我们可以做得更好吗?
我们可以,但为了做到这一点(我相信)我们需要使用不同的 numba cuda 方法。 vectorize
方法对于标量元素操作很方便,但它无法知道操作在整个数据集中的哪个位置进行。我们需要这些信息,我们可以在 CUDA 代码中获取它,但我们需要切换到 @cuda.jit
装饰器才能这样做。
下面的代码将之前的vectorize
a_cuda
函数转换成@cuda.jit
设备函数(基本上没有其他变化),然后我们创建一个CUDA内核做mesh根据提供的标量参数生成,并计算结果:
$ cat t40.py
import math
from numba import vectorize, float64, cuda
import numpy as np
from time import time
earthdiam_nm = 1.0
@cuda.jit(device='true')
def a_cuda(lat1, lon1, lat2, lon2, s):
a = (math.sin(0.008726645 * (lat2 - lat1))**2) + \
math.cos(0.01745329*(lat1)) * math.cos(0.01745329*(lat2)) * (math.sin(0.008726645 * (lon2 - lon1))**2)
return math.atan2(a, 1-a)*s
@cuda.jit
def LLA_distance_numba_cuda(lat2, lon2, xb, xe, yb, ye, s, nx, ny, out):
x,y = cuda.grid(2)
if x < nx and y < ny:
lat1 = (((xe-xb) * x)/(nx-1)) + xb # mesh generation
lon1 = (((ye-yb) * y)/(ny-1)) + yb # mesh generation
out[y][x] = a_cuda(lat1, lon1, lat2, lon2, s)
nx, ny = 1000,1000
Z = cuda.device_array((nx,ny), dtype=np.float64)
threads = (32,32)
blocks = (32,32)
start = time()
LLA_distance_numba_cuda[blocks,threads](30.0,101.0, 29.0, 31.0, 99.0, 101.0, earthdiam_nm, nx, ny, Z)
Zh = Z.copy_to_host()
print('{:d} total evaluations in {:.3f} seconds'.format(nx*ny,time()-start))
#print(Zh)
$ nvprof python t40.py
==2855== NVPROF is profiling process 2855, command: python t40.py
1000000 total evaluations in 0.294 seconds
==2855== Profiling application: python t40.py
==2855== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 75.60% 10.364ms 1 10.364ms 10.364ms 10.364ms cudapy::__main__::LLA_distance_numba_cuda1(double, double, double, double, double, double, double, __int64, __int64, Array<double, int=2, A, mutable, aligned>)
24.40% 3.3446ms 1 3.3446ms 3.3446ms 3.3446ms [CUDA memcpy DtoH]
(...)
现在我们看到:
- 内核运行时间甚至更长,约为 10 毫秒(因为我们正在生成网格)
- 没有从主机到设备的明确数据复制
- 整体函数运行时间已从 ~0.4s 减少到 ~0.3s