使用 GPU 计算 Python/Numba 中前 N 个最近城市的更好方法
Better way to compute Top N closest cities in Python/Numba using GPU
我有 M~200k 个点,城市的 X、Y 坐标打包在一个 Mx2 numpy 数组中。
目的是,对于每个城市,计算前 N 个最近的城市和 return 它们的索引和到 MxN numpy 矩阵中该城市的距离。
Numba 很好地加速了我在 CPU 上的串行 python 代码,并使用 prange 生成器使其成为多线程,这样在 16 核 SSE 4.2/AVX 机器上调用计算 30 个最近的城市在 6 分 26 秒内完成,同时使所有核心饱和:
@numba.njit(fastmath=True)#
def get_closest(n,N_CLOSEST,coords,i):
dist=np.empty(n,np.float32)
for j in range(n):
dist[j]=(coords[i,0]-coords[j,0])**2+(coords[i,1]-coords[j,1])**2
indices=np.argsort(dist)[1:N_CLOSEST+1]
return indices,dist[indices]
@numba.njit(fastmath=True,parallel=True)
def get_top_T_closest_cities(coords,T):
n=len(coords)
N_CLOSEST=min(T,n-1)
closest_distances=np.empty((n,N_CLOSEST),np.float32)
closest_cities=np.empty((n,N_CLOSEST),np.int32)
for i in prange(n):
closest_cities[i,:],closest_distances[i,:]=get_closest(n,N_CLOSEST,coords,i)
return closest_cities,closest_distances
closest_cities,closest_distances=get_top_T_closest_cities(data,30)
注:我想用
mrange=np.arange(N_CLOSEST)
和
指数=np.argpartition(距离,范围)
后来为了节省一些周期,但不幸的是 Numba 还不支持 np.argpartition。
然后我决定好好利用我新买的 RTX 2070,并尝试将这些非常并行的自然计算卸载到 GPU,再次使用 Numba 和可能的 CuPy。
经过一番思考,我想出了一个相对愚蠢的重写,其中一次处理一个城市的 GPU 内核为 M 个城市中的每一个城市连续调用。在那个内核中,每个 GPU 线程都在计算它的距离并保存到 dist 数组中的特定位置。所有数组都分配在设备上以最小化 PCI 数据传输:
import cupy as cp
def get_top_T_closest_cities_gpu(coords,T):
n=len(coords)
N_CLOSEST=min(T,n-1)
closest_distances=cp.empty((n,N_CLOSEST),cp.float32)
closest_cities=cp.empty((n,N_CLOSEST),cp.int32)
device_coords=cp.asarray(coords)
dist=cp.ndarray(n,cp.float32)
for i in range(n):
if (i % 1000)==0:
print(i)
closest_cities[i,:],closest_distances[i,:]=get_closest(n,N_CLOSEST,device_coords,i,dist)
return cp.asnumpy(closest_cities),cp.asnumpy(closest_distances)
@cuda.jit()
def get_distances(coords,i,n,dist):
stride = cuda.gridsize(1)
start = cuda.grid(1)
for j in range(start, n, stride):
dist[j]=(coords[i,0]-coords[j,0])**2+(coords[i,1]-coords[j,1])**2
def get_closest(n,N_CLOSEST,coords,i,dist):
get_distances[512,512](coords,i,n,dist)
indices=cp.argsort(dist)[1:N_CLOSEST+1]
return indices,dist[indices]
现在,GPU 上的计算结果花费了几乎相同的 6 分钟时间,但 GPU 负载仅为 1%(是的,我确保 return 由 CPU 编辑的结果和 GPU 版本相同).我玩了一下块大小,没有看到任何明显的变化。有趣的是,cp.argsort 和 get_distances 消耗的处理时间份额大致相同。
我觉得这与流有关,如何正确地初始化很多流?这将是重用我的代码的最直接方法,一次处理的不是 1 个城市,而是 16 个或我的计算能力允许的任何城市,理想情况下可能是 1000 个。
你们这些在 Numba/CuPy 中有 GPU 编码经验的人会建议在我的案例中充分利用 GPU 的能力吗?
来自纯 C++ CUDA 辩护者的建议也非常受欢迎,因为我还没有看到原生 Cuda 解决方案与 Numba/CuPy CUDA 解决方案的比较。
Python 版本:['3.6.3 |Anaconda, Inc']
平台:AMD64
系统:Windows-10-10.0.14393-SP0
Numba 版本:0.41.0
您似乎对基于 CUDA C++ 的答案感兴趣,所以我提供一个。将这个内核转换为等效的 numba cuda jit 内核应该很简单;翻译过程相当机械。
我选择的方法大纲如下:
- 为每个线程分配一个城市(该线程的参考城市)
- 每个线程遍历所有城市计算到其参考城市的成对距离
- 在计算每个距离时,会检查它是否在当前 "closest cities" 范围内。这里的方法是为每个线程保留一个 运行ning 列表。当每个线程计算一个新距离时,它会检查该距离是否小于 运行ning 列表中的当前最大距离。如果是,则当前最大值 distance/city 将替换为刚刚计算的最大值。
- "closest cities" 的列表及其距离保存在共享内存中。
- 所有距离计算完成后,每个线程然后排序并将其共享缓冲区中的值写入全局内存
- 这个基本计划有几个 "optimizations"。一种是每个warp一次读取32个连续的值,然后使用shuffle操作在线程之间传递这些值,以减少全局内存流量。
单个内核调用执行所有城市的整个操作。
此内核的性能会因您使用的 GPU 而有很大差异 运行。例如,在特斯拉 P100 上,它 运行s 大约需要 0.5s。在 Tesla K40 上大约需要 6 秒。在 Quadro K2000 上大约需要 40 秒。
这是 Tesla P100 的完整测试用例:
$ cat t364.cu
#include <iostream>
#include <math.h>
#include <stdlib.h>
#include <algorithm>
#include <vector>
#include <thrust/sort.h>
const int ncities = 200064; // should be divisible by nTPB
const int nclosest = 32; // maximum 48
const int nTPB = 128; // should be multiple of 32
const float FLOAT_MAX = 999999.0f;
__device__ void init_shared(float *d, int n){
int idx = threadIdx.x;
for (int i = 0; i < n; i++){
d[idx] = FLOAT_MAX;
idx += nTPB;}
}
__device__ int find_max(float *d, int n){
int idx = threadIdx.x;
int max = 0;
float maxd = d[idx];
for (int i = 1; i < n; i++){
idx += nTPB;
float next = d[idx];
if (maxd < next){
max = i;
maxd = next;}}
return max;
}
__device__ float compute_sqdist(float2 c1, float2 c2){
return (c1.x - c2.x)*(c1.x - c2.x) + (c1.y - c2.y)*(c1.y - c2.y);
}
__device__ void sort_and_store_sqrt(int idx, float *closest_dist, int *closest_id, float *dist, int *city, int n, int n_cities)
{
for (int i = n-1; i >= 0; i--){
int max = find_max(dist, n);
closest_dist[idx + i*n_cities] = sqrtf(dist[max*nTPB+threadIdx.x]);
closest_id[idx + i*n_cities] = city[max*nTPB+threadIdx.x];
dist[max*nTPB+threadIdx.x] = 0.0f;}
};
__device__ void shuffle(int &city_id, float2 &next_city_xy, int my_warp_lane){
int src_lane = (my_warp_lane+1)&31;
city_id = __shfl_sync(0xFFFFFFFF, city_id, src_lane);
next_city_xy.x = __shfl_sync(0xFFFFFFFF, next_city_xy.x, src_lane);
next_city_xy.y = __shfl_sync(0xFFFFFFFF, next_city_xy.y, src_lane);
}
__global__ void k(const float2 * __restrict__ cities, float * __restrict__ closest_dist, int * __restrict__ closest_id, const int n_cities){
__shared__ float dist[nclosest*nTPB];
__shared__ int city[nclosest*nTPB];
int idx = threadIdx.x+blockDim.x*blockIdx.x;
int my_warp_lane = (threadIdx.x & 31);
init_shared(dist, nclosest);
float2 my_city_xy = cities[idx];
float last_max = FLOAT_MAX;
for (int i = 0; i < n_cities; i+=32){
int city_id = i+my_warp_lane;
float2 next_city_xy = cities[city_id];
for (int j = 0; j < 32; j++){
if (j > 0) shuffle(city_id, next_city_xy, my_warp_lane);
if (idx != city_id){
float my_dist = compute_sqdist(my_city_xy, next_city_xy);
if (my_dist < last_max){
int max = find_max(dist, nclosest);
last_max = dist[max*nTPB+threadIdx.x];
if (my_dist < last_max) {
dist[max*nTPB+threadIdx.x] = my_dist;
city[max*nTPB+threadIdx.x] = city_id;}}}}}
sort_and_store_sqrt(idx, closest_dist, closest_id, dist, city, nclosest, n_cities);
}
void on_cpu(float2 *cities, float *dists, int *ids){
thrust::host_vector<int> my_ids(ncities);
thrust::host_vector<float> my_dists(ncities);
for (int i = 0; i < ncities; i++){
for (int j = 0; j < ncities; j++){
my_ids[j] = j;
if (i != j) my_dists[j] = sqrtf((cities[i].x - cities[j].x)*(cities[i].x - cities[j].x) + (cities[i].y - cities[j].y)*(cities[i].y - cities[j].y));
else my_dists[j] = FLOAT_MAX;}
thrust::sort_by_key(my_dists.begin(), my_dists.end(), my_ids.begin());
for (int j = 0; j < nclosest; j++){
dists[i+j*ncities] = my_dists[j];
ids[i+j*ncities] = my_ids[j];}
}
}
int main(){
float2 *h_cities, *d_cities;
float *h_closest_dist, *d_closest_dist, *h_closest_dist_cpu;
int *h_closest_id, *d_closest_id, *h_closest_id_cpu;
cudaMalloc(&d_cities, ncities*sizeof(float2));
cudaMalloc(&d_closest_dist, nclosest*ncities*sizeof(float));
cudaMalloc(&d_closest_id, nclosest*ncities*sizeof(int));
h_cities = new float2[ncities];
h_closest_dist = new float[ncities*nclosest];
h_closest_id = new int[ncities*nclosest];
h_closest_dist_cpu = new float[ncities*nclosest];
h_closest_id_cpu = new int[ncities*nclosest];
for (int i = 0; i < ncities; i++){
h_cities[i].x = (rand()/(float)RAND_MAX)*10.0f;
h_cities[i].y = (rand()/(float)RAND_MAX)*10.0f;}
cudaMemcpy(d_cities, h_cities, ncities*sizeof(float2), cudaMemcpyHostToDevice);
k<<<ncities/nTPB, nTPB>>>(d_cities, d_closest_dist, d_closest_id, ncities);
cudaMemcpy(h_closest_dist, d_closest_dist, ncities*nclosest*sizeof(float),cudaMemcpyDeviceToHost);
cudaMemcpy(h_closest_id, d_closest_id, ncities*nclosest*sizeof(int),cudaMemcpyDeviceToHost);
for (int i = 0; i < nclosest; i++){
for (int j = 0; j < 5; j++) std::cout << h_closest_id[j+i*ncities] << "," << h_closest_dist[j+i*ncities] << " ";
std::cout << std::endl;}
if (ncities < 5000){
on_cpu(h_cities, h_closest_dist_cpu, h_closest_id_cpu);
for (int i = 0; i < ncities*nclosest; i++)
if (h_closest_id_cpu[i] != h_closest_id[i]) {std::cout << "mismatch at: " << i << " was: " << h_closest_id[i] << " should be: " << h_closest_id_cpu[i] << std::endl; return -1;}
}
return 0;
}
$ nvcc -o t364 t364.cu
$ nvprof ./t364
==31871== NVPROF is profiling process 31871, command: ./t364
33581,0.00466936 116487,0.0163371 121419,0.0119542 138585,0.00741395 187892,0.0205568
56138,0.0106946 105637,0.0195111 175565,0.0132018 121719,0.0090809 198333,0.0269794
36458,0.014851 6942,0.0244329 67920,0.013367 140919,0.014739 91533,0.0348142
48152,0.0221216 146867,0.0250192 14656,0.020469 163149,0.0247639 162442,0.0354359
3681,0.0226946 127671,0.0265515 32841,0.0219539 109520,0.0359874 21346,0.0424094
20903,0.0313963 3075,0.0279635 79787,0.0220388 106161,0.0365807 24186,0.0453916
191226,0.0316818 4168,0.0297535 126726,0.0246147 154598,0.0374218 62106,0.0459001
185573,0.0349628 76270,0.030849 122878,0.0249695 124897,0.0447656 38124,0.0463704
71252,0.037517 18879,0.0350544 112410,0.0299296 102006,0.0462593 12361,0.0464608
172641,0.0399721 134288,0.035077 39252,0.031506 164570,0.0470057 81105,0.0502873
18960,0.0433545 53266,0.0360357 195467,0.0334281 36715,0.0470069 153375,0.0508115
163568,0.0442928 95544,0.0361058 151839,0.0398384 114041,0.0490263 8524,0.0511531
182326,0.047519 59919,0.0362906 90810,0.0408069 52013,0.0494515 16793,0.0525569
169860,0.048417 77412,0.036694 12065,0.0414496 137863,0.0494703 197500,0.0537481
40158,0.0492621 34592,0.0377113 54812,0.041594 58792,0.049532 70501,0.0541114
121444,0.0501154 102800,0.0414865 96238,0.0433548 34323,0.0531493 161527,0.0551868
118564,0.0509228 82832,0.0449587 167965,0.0439488 104475,0.0534779 66968,0.0572788
60136,0.0528873 88318,0.0455667 118562,0.0462537 129853,0.0535594 7544,0.0588875
95975,0.0587857 65792,0.0479467 124929,0.046828 116360,0.0537344 37341,0.0594454
62542,0.0592229 57399,0.0509665 186583,0.0470843 47571,0.0541045 81275,0.0596965
11499,0.0607943 28314,0.0512354 23730,0.0518801 176089,0.0543222 562,0.06527
131594,0.0611795 23120,0.0515408 25933,0.0547776 117474,0.0557752 194499,0.0657885
23959,0.0623019 137283,0.0533193 173000,0.0577509 157537,0.0566689 193091,0.0666895
5660,0.0629772 15498,0.0555821 161025,0.0596721 123148,0.0589929 147928,0.0672529
51033,0.063036 15456,0.0565314 145859,0.0601408 3012,0.0601779 107646,0.0687241
26790,0.0643055 99659,0.0576798 33153,0.0603556 48388,0.0617377 47566,0.0689055
178826,0.0655352 143209,0.058413 101960,0.0604994 22146,0.0620504 91118,0.0705487
32108,0.0672866 172089,0.058676 17885,0.0638383 137318,0.0624543 138223,0.0716578
38125,0.0678566 42847,0.0609811 10879,0.0681518 154360,0.0633921 96195,0.0723226
96583,0.0683073 169703,0.0621889 100839,0.0721133 28595,0.0661302 20422,0.0731882
98329,0.0683718 50432,0.0636618 84204,0.0733909 181919,0.066552 75375,0.0736715
75814,0.0694582 161714,0.0674298 89049,0.0749184 151275,0.0679411 37849,0.0739173
==31871== Profiling application: ./t364
==31871== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 95.66% 459.88ms 1 459.88ms 459.88ms 459.88ms k(float2 const *, float*, int*, int)
4.30% 20.681ms 2 10.340ms 10.255ms 10.425ms [CUDA memcpy DtoH]
0.04% 195.55us 1 195.55us 195.55us 195.55us [CUDA memcpy HtoD]
API calls: 58.18% 482.42ms 3 160.81ms 472.46us 470.80ms cudaMemcpy
41.05% 340.38ms 3 113.46ms 322.83us 339.73ms cudaMalloc
0.55% 4.5490ms 384 11.846us 339ns 503.79us cuDeviceGetAttribute
0.16% 1.3131ms 4 328.27us 208.85us 513.66us cuDeviceTotalMem
0.05% 407.13us 4 101.78us 93.796us 118.27us cuDeviceGetName
0.01% 61.719us 1 61.719us 61.719us 61.719us cudaLaunchKernel
0.00% 23.965us 4 5.9910us 3.9830us 8.9510us cuDeviceGetPCIBusId
0.00% 6.8440us 8 855ns 390ns 1.8150us cuDeviceGet
0.00% 3.7490us 3 1.2490us 339ns 2.1300us cuDeviceGetCount
0.00% 2.0260us 4 506ns 397ns 735ns cuDeviceGetUuid
$
CUDA 10.0、Tesla P100、CentOS 7
该代码包括验证的可能性,但它根据基于 CPU 的原始代码验证结果,这需要更长的时间。所以我将验证限制在较小的测试用例中,例如。 4096个城市。
这是使用 numba cuda 对上述代码的 "translated" 近似。它似乎 运行 比 CUDA C++ 版本慢大约 2 倍,但事实并非如此。 (促成因素之一似乎是 sqrt
函数的使用——它对 numba 代码有显着的性能影响,但对 CUDA C++ 代码没有性能影响。它可能使用双精度实现在 numba CUDA 的引擎盖下。)无论如何,它提供了如何在 numba 中实现它的参考:
import numpy as np
import numba as nb
import math
from numba import cuda,float32,int32
# number of cities, should be divisible by threadblock size
N = 200064
# number of "closest" cities
TN = 32
#number of threads per block
threadsperblock = 128
#shared memory size of each array in elements
smSize=TN*threadsperblock
@cuda.jit('int32(float32[:])',device=True)
def find_max(dist):
idx = cuda.threadIdx.x
mymax = 0
maxd = dist[idx]
for i in range(TN-1):
idx += threadsperblock
mynext = dist[idx]
if maxd < mynext:
mymax = i+1
maxd = mynext
return mymax
@cuda.jit('void(float32[:], float32[:], float32[:], int32[:], int32)')
def k(citiesx, citiesy, td, ti, nc):
dist = cuda.shared.array(smSize, float32)
city = cuda.shared.array(smSize, int32)
idx = cuda.grid(1)
tid = cuda.threadIdx.x
wl = tid & 31
my_city_x = citiesx[idx];
my_city_y = citiesy[idx];
last_max = 99999.0
for i in range(TN):
dist[tid+i*threadsperblock] = 99999.0
for i in xrange(0,nc,32):
city_id = i+wl
next_city_x = citiesx[city_id]
next_city_y = citiesy[city_id]
for j in range(32):
if j > 0:
src_lane = (wl+1)&31
city_id = cuda.shfl_sync(0xFFFFFFFF, city_id, src_lane)
next_city_x = cuda.shfl_sync(0xFFFFFFFF, next_city_x, src_lane)
next_city_y = cuda.shfl_sync(0xFFFFFFFF, next_city_y, src_lane)
if idx != city_id:
my_dist = (next_city_x - my_city_x)*(next_city_x - my_city_x) + (next_city_y - my_city_y)*(next_city_y - my_city_y)
if my_dist < last_max:
mymax = find_max(dist)
last_max = dist[mymax*threadsperblock+tid]
if my_dist < last_max:
dist[mymax*threadsperblock+tid] = my_dist
city[mymax*threadsperblock+tid] = city_id
for i in range(TN):
mymax = find_max(dist)
td[idx+i*nc] = math.sqrt(dist[mymax*threadsperblock+tid])
ti[idx+i*nc] = city[mymax*threadsperblock+tid]
dist[mymax*threadsperblock+tid] = 0
#peform test
cx = np.random.rand(N).astype(np.float32)
cy = np.random.rand(N).astype(np.float32)
td = np.zeros(N*TN, dtype=np.float32)
ti = np.zeros(N*TN, dtype=np.int32)
d_cx = cuda.to_device(cx)
d_cy = cuda.to_device(cy)
d_td = cuda.device_array_like(td)
d_ti = cuda.device_array_like(ti)
k[N/threadsperblock, threadsperblock](d_cx, d_cy, d_td, d_ti, N)
d_td.copy_to_host(td)
d_ti.copy_to_host(ti)
for i in range(TN):
for j in range(1):
print(ti[i*N+j])
print(td[i*N+j])
我没有在这个 numba cuda 变体中包含验证,因此它可能包含缺陷。
编辑:通过将上述代码示例从每个块 128 个线程切换到每个块 64 个线程,代码将 运行 明显更快。这是由于根据共享内存占用限制器,这允许增加理论占用率和实现占用率。
正如 max9111 在评论中所指出的,存在更好的算法。这是上面 python 代码和基于树的算法(借用答案 here)之间的比较(在非常慢的 GPU 上):
$ cat t27.py
import numpy as np
import numba as nb
import math
from numba import cuda,float32,int32
from scipy import spatial
import time
#Create some coordinates and indices
#It is assumed that the coordinates are unique (only one entry per hydrant)
ncities = 200064
Coords=np.random.rand(ncities*2).reshape(ncities,2)
Coords*=100
Indices=np.arange(ncities) #Indices
nnear = 33
def get_indices_of_nearest_neighbours(Coords,Indices):
tree=spatial.cKDTree(Coords)
#k=4 because the first entry is the nearest neighbour
# of a point with itself
res=tree.query(Coords, k=nnear)[1][:,1:]
return Indices[res]
start = time.time()
a = get_indices_of_nearest_neighbours(Coords, Indices)
end = time.time()
print (a.shape[0],a.shape[1])
print
print(end -start)
# number of cities, should be divisible by threadblock size
N = ncities
# number of "closest" cities
TN = nnear - 1
#number of threads per block
threadsperblock = 64
#shared memory size of each array in elements
smSize=TN*threadsperblock
@cuda.jit('int32(float32[:])',device=True)
def find_max(dist):
idx = cuda.threadIdx.x
mymax = 0
maxd = dist[idx]
for i in range(TN-1):
idx += threadsperblock
mynext = dist[idx]
if maxd < mynext:
mymax = i+1
maxd = mynext
return mymax
@cuda.jit('void(float32[:], float32[:], float32[:], int32[:], int32)')
def k(citiesx, citiesy, td, ti, nc):
dist = cuda.shared.array(smSize, float32)
city = cuda.shared.array(smSize, int32)
idx = cuda.grid(1)
tid = cuda.threadIdx.x
wl = tid & 31
my_city_x = citiesx[idx];
my_city_y = citiesy[idx];
last_max = 99999.0
for i in range(TN):
dist[tid+i*threadsperblock] = 99999.0
for i in xrange(0,nc,32):
city_id = i+wl
next_city_x = citiesx[city_id]
next_city_y = citiesy[city_id]
for j in range(32):
if j > 0:
src_lane = (wl+1)&31
city_id = cuda.shfl_sync(0xFFFFFFFF, city_id, src_lane)
next_city_x = cuda.shfl_sync(0xFFFFFFFF, next_city_x, src_lane)
next_city_y = cuda.shfl_sync(0xFFFFFFFF, next_city_y, src_lane)
if idx != city_id:
my_dist = (next_city_x - my_city_x)*(next_city_x - my_city_x) + (next_city_y - my_city_y)*(next_city_y - my_city_y)
if my_dist < last_max:
mymax = find_max(dist)
last_max = dist[mymax*threadsperblock+tid]
if my_dist < last_max:
dist[mymax*threadsperblock+tid] = my_dist
city[mymax*threadsperblock+tid] = city_id
for i in range(TN):
mymax = find_max(dist)
td[idx+i*nc] = math.sqrt(dist[mymax*threadsperblock+tid])
ti[idx+i*nc] = city[mymax*threadsperblock+tid]
dist[mymax*threadsperblock+tid] = 0
#peform test
cx = np.zeros(N).astype(np.float32)
cy = np.zeros(N).astype(np.float32)
for i in range(N):
cx[i] = Coords[i,0]
cy[i] = Coords[i,1]
td = np.zeros(N*TN, dtype=np.float32)
ti = np.zeros(N*TN, dtype=np.int32)
start = time.time()
d_cx = cuda.to_device(cx)
d_cy = cuda.to_device(cy)
d_td = cuda.device_array_like(td)
d_ti = cuda.device_array_like(ti)
k[N/threadsperblock, threadsperblock](d_cx, d_cy, d_td, d_ti, N)
d_td.copy_to_host(td)
d_ti.copy_to_host(ti)
end = time.time()
print(a)
print
print(end - start)
print(np.flip(np.transpose(ti.reshape(nnear-1,ncities)), 1))
$ python t27.py
(200064, 32)
1.32144594193
[[177220 25281 159413 ..., 133366 45179 27508]
[ 56956 163534 90723 ..., 135081 33025 167104]
[ 88708 42958 162851 ..., 115964 77180 31684]
...,
[186540 52500 124911 ..., 157102 83407 38152]
[151053 144837 34487 ..., 171148 37887 12591]
[ 13820 199968 88667 ..., 7241 172376 51969]]
44.1796119213
[[177220 25281 159413 ..., 133366 45179 27508]
[ 56956 163534 90723 ..., 135081 33025 167104]
[ 88708 42958 162851 ..., 115964 77180 31684]
...,
[186540 52500 124911 ..., 157102 83407 38152]
[151053 144837 34487 ..., 171148 37887 12591]
[ 13820 199968 88667 ..., 7241 172376 51969]]
$
在这个非常慢的 Quadro K2000 GPU 上,CPU/scipy-based kd-tree 算法比这个 GPU 实现快得多。然而,在 ~1.3s 时,scipy 实现仍然比 Tesla P100 上的蛮力 CUDA C++ 代码 运行ning 慢一点,而且现在有更快的 GPU。 here. As pointed out there, the brute force approach is easily parallelizable, whereas the kd-tree approach may be non-trivial to parallelize. In any event, a faster solution yet might be a GPU-based kd-tree implementation.
给出了这种差异的可能解释
我有 M~200k 个点,城市的 X、Y 坐标打包在一个 Mx2 numpy 数组中。 目的是,对于每个城市,计算前 N 个最近的城市和 return 它们的索引和到 MxN numpy 矩阵中该城市的距离。 Numba 很好地加速了我在 CPU 上的串行 python 代码,并使用 prange 生成器使其成为多线程,这样在 16 核 SSE 4.2/AVX 机器上调用计算 30 个最近的城市在 6 分 26 秒内完成,同时使所有核心饱和:
@numba.njit(fastmath=True)#
def get_closest(n,N_CLOSEST,coords,i):
dist=np.empty(n,np.float32)
for j in range(n):
dist[j]=(coords[i,0]-coords[j,0])**2+(coords[i,1]-coords[j,1])**2
indices=np.argsort(dist)[1:N_CLOSEST+1]
return indices,dist[indices]
@numba.njit(fastmath=True,parallel=True)
def get_top_T_closest_cities(coords,T):
n=len(coords)
N_CLOSEST=min(T,n-1)
closest_distances=np.empty((n,N_CLOSEST),np.float32)
closest_cities=np.empty((n,N_CLOSEST),np.int32)
for i in prange(n):
closest_cities[i,:],closest_distances[i,:]=get_closest(n,N_CLOSEST,coords,i)
return closest_cities,closest_distances
closest_cities,closest_distances=get_top_T_closest_cities(data,30)
注:我想用 mrange=np.arange(N_CLOSEST) 和 指数=np.argpartition(距离,范围) 后来为了节省一些周期,但不幸的是 Numba 还不支持 np.argpartition。
然后我决定好好利用我新买的 RTX 2070,并尝试将这些非常并行的自然计算卸载到 GPU,再次使用 Numba 和可能的 CuPy。
经过一番思考,我想出了一个相对愚蠢的重写,其中一次处理一个城市的 GPU 内核为 M 个城市中的每一个城市连续调用。在那个内核中,每个 GPU 线程都在计算它的距离并保存到 dist 数组中的特定位置。所有数组都分配在设备上以最小化 PCI 数据传输:
import cupy as cp
def get_top_T_closest_cities_gpu(coords,T):
n=len(coords)
N_CLOSEST=min(T,n-1)
closest_distances=cp.empty((n,N_CLOSEST),cp.float32)
closest_cities=cp.empty((n,N_CLOSEST),cp.int32)
device_coords=cp.asarray(coords)
dist=cp.ndarray(n,cp.float32)
for i in range(n):
if (i % 1000)==0:
print(i)
closest_cities[i,:],closest_distances[i,:]=get_closest(n,N_CLOSEST,device_coords,i,dist)
return cp.asnumpy(closest_cities),cp.asnumpy(closest_distances)
@cuda.jit()
def get_distances(coords,i,n,dist):
stride = cuda.gridsize(1)
start = cuda.grid(1)
for j in range(start, n, stride):
dist[j]=(coords[i,0]-coords[j,0])**2+(coords[i,1]-coords[j,1])**2
def get_closest(n,N_CLOSEST,coords,i,dist):
get_distances[512,512](coords,i,n,dist)
indices=cp.argsort(dist)[1:N_CLOSEST+1]
return indices,dist[indices]
现在,GPU 上的计算结果花费了几乎相同的 6 分钟时间,但 GPU 负载仅为 1%(是的,我确保 return 由 CPU 编辑的结果和 GPU 版本相同).我玩了一下块大小,没有看到任何明显的变化。有趣的是,cp.argsort 和 get_distances 消耗的处理时间份额大致相同。
我觉得这与流有关,如何正确地初始化很多流?这将是重用我的代码的最直接方法,一次处理的不是 1 个城市,而是 16 个或我的计算能力允许的任何城市,理想情况下可能是 1000 个。
你们这些在 Numba/CuPy 中有 GPU 编码经验的人会建议在我的案例中充分利用 GPU 的能力吗?
来自纯 C++ CUDA 辩护者的建议也非常受欢迎,因为我还没有看到原生 Cuda 解决方案与 Numba/CuPy CUDA 解决方案的比较。
Python 版本:['3.6.3 |Anaconda, Inc'] 平台:AMD64 系统:Windows-10-10.0.14393-SP0 Numba 版本:0.41.0
您似乎对基于 CUDA C++ 的答案感兴趣,所以我提供一个。将这个内核转换为等效的 numba cuda jit 内核应该很简单;翻译过程相当机械。
我选择的方法大纲如下:
- 为每个线程分配一个城市(该线程的参考城市)
- 每个线程遍历所有城市计算到其参考城市的成对距离
- 在计算每个距离时,会检查它是否在当前 "closest cities" 范围内。这里的方法是为每个线程保留一个 运行ning 列表。当每个线程计算一个新距离时,它会检查该距离是否小于 运行ning 列表中的当前最大距离。如果是,则当前最大值 distance/city 将替换为刚刚计算的最大值。
- "closest cities" 的列表及其距离保存在共享内存中。
- 所有距离计算完成后,每个线程然后排序并将其共享缓冲区中的值写入全局内存
- 这个基本计划有几个 "optimizations"。一种是每个warp一次读取32个连续的值,然后使用shuffle操作在线程之间传递这些值,以减少全局内存流量。
单个内核调用执行所有城市的整个操作。
此内核的性能会因您使用的 GPU 而有很大差异 运行。例如,在特斯拉 P100 上,它 运行s 大约需要 0.5s。在 Tesla K40 上大约需要 6 秒。在 Quadro K2000 上大约需要 40 秒。
这是 Tesla P100 的完整测试用例:
$ cat t364.cu
#include <iostream>
#include <math.h>
#include <stdlib.h>
#include <algorithm>
#include <vector>
#include <thrust/sort.h>
const int ncities = 200064; // should be divisible by nTPB
const int nclosest = 32; // maximum 48
const int nTPB = 128; // should be multiple of 32
const float FLOAT_MAX = 999999.0f;
__device__ void init_shared(float *d, int n){
int idx = threadIdx.x;
for (int i = 0; i < n; i++){
d[idx] = FLOAT_MAX;
idx += nTPB;}
}
__device__ int find_max(float *d, int n){
int idx = threadIdx.x;
int max = 0;
float maxd = d[idx];
for (int i = 1; i < n; i++){
idx += nTPB;
float next = d[idx];
if (maxd < next){
max = i;
maxd = next;}}
return max;
}
__device__ float compute_sqdist(float2 c1, float2 c2){
return (c1.x - c2.x)*(c1.x - c2.x) + (c1.y - c2.y)*(c1.y - c2.y);
}
__device__ void sort_and_store_sqrt(int idx, float *closest_dist, int *closest_id, float *dist, int *city, int n, int n_cities)
{
for (int i = n-1; i >= 0; i--){
int max = find_max(dist, n);
closest_dist[idx + i*n_cities] = sqrtf(dist[max*nTPB+threadIdx.x]);
closest_id[idx + i*n_cities] = city[max*nTPB+threadIdx.x];
dist[max*nTPB+threadIdx.x] = 0.0f;}
};
__device__ void shuffle(int &city_id, float2 &next_city_xy, int my_warp_lane){
int src_lane = (my_warp_lane+1)&31;
city_id = __shfl_sync(0xFFFFFFFF, city_id, src_lane);
next_city_xy.x = __shfl_sync(0xFFFFFFFF, next_city_xy.x, src_lane);
next_city_xy.y = __shfl_sync(0xFFFFFFFF, next_city_xy.y, src_lane);
}
__global__ void k(const float2 * __restrict__ cities, float * __restrict__ closest_dist, int * __restrict__ closest_id, const int n_cities){
__shared__ float dist[nclosest*nTPB];
__shared__ int city[nclosest*nTPB];
int idx = threadIdx.x+blockDim.x*blockIdx.x;
int my_warp_lane = (threadIdx.x & 31);
init_shared(dist, nclosest);
float2 my_city_xy = cities[idx];
float last_max = FLOAT_MAX;
for (int i = 0; i < n_cities; i+=32){
int city_id = i+my_warp_lane;
float2 next_city_xy = cities[city_id];
for (int j = 0; j < 32; j++){
if (j > 0) shuffle(city_id, next_city_xy, my_warp_lane);
if (idx != city_id){
float my_dist = compute_sqdist(my_city_xy, next_city_xy);
if (my_dist < last_max){
int max = find_max(dist, nclosest);
last_max = dist[max*nTPB+threadIdx.x];
if (my_dist < last_max) {
dist[max*nTPB+threadIdx.x] = my_dist;
city[max*nTPB+threadIdx.x] = city_id;}}}}}
sort_and_store_sqrt(idx, closest_dist, closest_id, dist, city, nclosest, n_cities);
}
void on_cpu(float2 *cities, float *dists, int *ids){
thrust::host_vector<int> my_ids(ncities);
thrust::host_vector<float> my_dists(ncities);
for (int i = 0; i < ncities; i++){
for (int j = 0; j < ncities; j++){
my_ids[j] = j;
if (i != j) my_dists[j] = sqrtf((cities[i].x - cities[j].x)*(cities[i].x - cities[j].x) + (cities[i].y - cities[j].y)*(cities[i].y - cities[j].y));
else my_dists[j] = FLOAT_MAX;}
thrust::sort_by_key(my_dists.begin(), my_dists.end(), my_ids.begin());
for (int j = 0; j < nclosest; j++){
dists[i+j*ncities] = my_dists[j];
ids[i+j*ncities] = my_ids[j];}
}
}
int main(){
float2 *h_cities, *d_cities;
float *h_closest_dist, *d_closest_dist, *h_closest_dist_cpu;
int *h_closest_id, *d_closest_id, *h_closest_id_cpu;
cudaMalloc(&d_cities, ncities*sizeof(float2));
cudaMalloc(&d_closest_dist, nclosest*ncities*sizeof(float));
cudaMalloc(&d_closest_id, nclosest*ncities*sizeof(int));
h_cities = new float2[ncities];
h_closest_dist = new float[ncities*nclosest];
h_closest_id = new int[ncities*nclosest];
h_closest_dist_cpu = new float[ncities*nclosest];
h_closest_id_cpu = new int[ncities*nclosest];
for (int i = 0; i < ncities; i++){
h_cities[i].x = (rand()/(float)RAND_MAX)*10.0f;
h_cities[i].y = (rand()/(float)RAND_MAX)*10.0f;}
cudaMemcpy(d_cities, h_cities, ncities*sizeof(float2), cudaMemcpyHostToDevice);
k<<<ncities/nTPB, nTPB>>>(d_cities, d_closest_dist, d_closest_id, ncities);
cudaMemcpy(h_closest_dist, d_closest_dist, ncities*nclosest*sizeof(float),cudaMemcpyDeviceToHost);
cudaMemcpy(h_closest_id, d_closest_id, ncities*nclosest*sizeof(int),cudaMemcpyDeviceToHost);
for (int i = 0; i < nclosest; i++){
for (int j = 0; j < 5; j++) std::cout << h_closest_id[j+i*ncities] << "," << h_closest_dist[j+i*ncities] << " ";
std::cout << std::endl;}
if (ncities < 5000){
on_cpu(h_cities, h_closest_dist_cpu, h_closest_id_cpu);
for (int i = 0; i < ncities*nclosest; i++)
if (h_closest_id_cpu[i] != h_closest_id[i]) {std::cout << "mismatch at: " << i << " was: " << h_closest_id[i] << " should be: " << h_closest_id_cpu[i] << std::endl; return -1;}
}
return 0;
}
$ nvcc -o t364 t364.cu
$ nvprof ./t364
==31871== NVPROF is profiling process 31871, command: ./t364
33581,0.00466936 116487,0.0163371 121419,0.0119542 138585,0.00741395 187892,0.0205568
56138,0.0106946 105637,0.0195111 175565,0.0132018 121719,0.0090809 198333,0.0269794
36458,0.014851 6942,0.0244329 67920,0.013367 140919,0.014739 91533,0.0348142
48152,0.0221216 146867,0.0250192 14656,0.020469 163149,0.0247639 162442,0.0354359
3681,0.0226946 127671,0.0265515 32841,0.0219539 109520,0.0359874 21346,0.0424094
20903,0.0313963 3075,0.0279635 79787,0.0220388 106161,0.0365807 24186,0.0453916
191226,0.0316818 4168,0.0297535 126726,0.0246147 154598,0.0374218 62106,0.0459001
185573,0.0349628 76270,0.030849 122878,0.0249695 124897,0.0447656 38124,0.0463704
71252,0.037517 18879,0.0350544 112410,0.0299296 102006,0.0462593 12361,0.0464608
172641,0.0399721 134288,0.035077 39252,0.031506 164570,0.0470057 81105,0.0502873
18960,0.0433545 53266,0.0360357 195467,0.0334281 36715,0.0470069 153375,0.0508115
163568,0.0442928 95544,0.0361058 151839,0.0398384 114041,0.0490263 8524,0.0511531
182326,0.047519 59919,0.0362906 90810,0.0408069 52013,0.0494515 16793,0.0525569
169860,0.048417 77412,0.036694 12065,0.0414496 137863,0.0494703 197500,0.0537481
40158,0.0492621 34592,0.0377113 54812,0.041594 58792,0.049532 70501,0.0541114
121444,0.0501154 102800,0.0414865 96238,0.0433548 34323,0.0531493 161527,0.0551868
118564,0.0509228 82832,0.0449587 167965,0.0439488 104475,0.0534779 66968,0.0572788
60136,0.0528873 88318,0.0455667 118562,0.0462537 129853,0.0535594 7544,0.0588875
95975,0.0587857 65792,0.0479467 124929,0.046828 116360,0.0537344 37341,0.0594454
62542,0.0592229 57399,0.0509665 186583,0.0470843 47571,0.0541045 81275,0.0596965
11499,0.0607943 28314,0.0512354 23730,0.0518801 176089,0.0543222 562,0.06527
131594,0.0611795 23120,0.0515408 25933,0.0547776 117474,0.0557752 194499,0.0657885
23959,0.0623019 137283,0.0533193 173000,0.0577509 157537,0.0566689 193091,0.0666895
5660,0.0629772 15498,0.0555821 161025,0.0596721 123148,0.0589929 147928,0.0672529
51033,0.063036 15456,0.0565314 145859,0.0601408 3012,0.0601779 107646,0.0687241
26790,0.0643055 99659,0.0576798 33153,0.0603556 48388,0.0617377 47566,0.0689055
178826,0.0655352 143209,0.058413 101960,0.0604994 22146,0.0620504 91118,0.0705487
32108,0.0672866 172089,0.058676 17885,0.0638383 137318,0.0624543 138223,0.0716578
38125,0.0678566 42847,0.0609811 10879,0.0681518 154360,0.0633921 96195,0.0723226
96583,0.0683073 169703,0.0621889 100839,0.0721133 28595,0.0661302 20422,0.0731882
98329,0.0683718 50432,0.0636618 84204,0.0733909 181919,0.066552 75375,0.0736715
75814,0.0694582 161714,0.0674298 89049,0.0749184 151275,0.0679411 37849,0.0739173
==31871== Profiling application: ./t364
==31871== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 95.66% 459.88ms 1 459.88ms 459.88ms 459.88ms k(float2 const *, float*, int*, int)
4.30% 20.681ms 2 10.340ms 10.255ms 10.425ms [CUDA memcpy DtoH]
0.04% 195.55us 1 195.55us 195.55us 195.55us [CUDA memcpy HtoD]
API calls: 58.18% 482.42ms 3 160.81ms 472.46us 470.80ms cudaMemcpy
41.05% 340.38ms 3 113.46ms 322.83us 339.73ms cudaMalloc
0.55% 4.5490ms 384 11.846us 339ns 503.79us cuDeviceGetAttribute
0.16% 1.3131ms 4 328.27us 208.85us 513.66us cuDeviceTotalMem
0.05% 407.13us 4 101.78us 93.796us 118.27us cuDeviceGetName
0.01% 61.719us 1 61.719us 61.719us 61.719us cudaLaunchKernel
0.00% 23.965us 4 5.9910us 3.9830us 8.9510us cuDeviceGetPCIBusId
0.00% 6.8440us 8 855ns 390ns 1.8150us cuDeviceGet
0.00% 3.7490us 3 1.2490us 339ns 2.1300us cuDeviceGetCount
0.00% 2.0260us 4 506ns 397ns 735ns cuDeviceGetUuid
$
CUDA 10.0、Tesla P100、CentOS 7
该代码包括验证的可能性,但它根据基于 CPU 的原始代码验证结果,这需要更长的时间。所以我将验证限制在较小的测试用例中,例如。 4096个城市。
这是使用 numba cuda 对上述代码的 "translated" 近似。它似乎 运行 比 CUDA C++ 版本慢大约 2 倍,但事实并非如此。 (促成因素之一似乎是 sqrt
函数的使用——它对 numba 代码有显着的性能影响,但对 CUDA C++ 代码没有性能影响。它可能使用双精度实现在 numba CUDA 的引擎盖下。)无论如何,它提供了如何在 numba 中实现它的参考:
import numpy as np
import numba as nb
import math
from numba import cuda,float32,int32
# number of cities, should be divisible by threadblock size
N = 200064
# number of "closest" cities
TN = 32
#number of threads per block
threadsperblock = 128
#shared memory size of each array in elements
smSize=TN*threadsperblock
@cuda.jit('int32(float32[:])',device=True)
def find_max(dist):
idx = cuda.threadIdx.x
mymax = 0
maxd = dist[idx]
for i in range(TN-1):
idx += threadsperblock
mynext = dist[idx]
if maxd < mynext:
mymax = i+1
maxd = mynext
return mymax
@cuda.jit('void(float32[:], float32[:], float32[:], int32[:], int32)')
def k(citiesx, citiesy, td, ti, nc):
dist = cuda.shared.array(smSize, float32)
city = cuda.shared.array(smSize, int32)
idx = cuda.grid(1)
tid = cuda.threadIdx.x
wl = tid & 31
my_city_x = citiesx[idx];
my_city_y = citiesy[idx];
last_max = 99999.0
for i in range(TN):
dist[tid+i*threadsperblock] = 99999.0
for i in xrange(0,nc,32):
city_id = i+wl
next_city_x = citiesx[city_id]
next_city_y = citiesy[city_id]
for j in range(32):
if j > 0:
src_lane = (wl+1)&31
city_id = cuda.shfl_sync(0xFFFFFFFF, city_id, src_lane)
next_city_x = cuda.shfl_sync(0xFFFFFFFF, next_city_x, src_lane)
next_city_y = cuda.shfl_sync(0xFFFFFFFF, next_city_y, src_lane)
if idx != city_id:
my_dist = (next_city_x - my_city_x)*(next_city_x - my_city_x) + (next_city_y - my_city_y)*(next_city_y - my_city_y)
if my_dist < last_max:
mymax = find_max(dist)
last_max = dist[mymax*threadsperblock+tid]
if my_dist < last_max:
dist[mymax*threadsperblock+tid] = my_dist
city[mymax*threadsperblock+tid] = city_id
for i in range(TN):
mymax = find_max(dist)
td[idx+i*nc] = math.sqrt(dist[mymax*threadsperblock+tid])
ti[idx+i*nc] = city[mymax*threadsperblock+tid]
dist[mymax*threadsperblock+tid] = 0
#peform test
cx = np.random.rand(N).astype(np.float32)
cy = np.random.rand(N).astype(np.float32)
td = np.zeros(N*TN, dtype=np.float32)
ti = np.zeros(N*TN, dtype=np.int32)
d_cx = cuda.to_device(cx)
d_cy = cuda.to_device(cy)
d_td = cuda.device_array_like(td)
d_ti = cuda.device_array_like(ti)
k[N/threadsperblock, threadsperblock](d_cx, d_cy, d_td, d_ti, N)
d_td.copy_to_host(td)
d_ti.copy_to_host(ti)
for i in range(TN):
for j in range(1):
print(ti[i*N+j])
print(td[i*N+j])
我没有在这个 numba cuda 变体中包含验证,因此它可能包含缺陷。
编辑:通过将上述代码示例从每个块 128 个线程切换到每个块 64 个线程,代码将 运行 明显更快。这是由于根据共享内存占用限制器,这允许增加理论占用率和实现占用率。
正如 max9111 在评论中所指出的,存在更好的算法。这是上面 python 代码和基于树的算法(借用答案 here)之间的比较(在非常慢的 GPU 上):
$ cat t27.py
import numpy as np
import numba as nb
import math
from numba import cuda,float32,int32
from scipy import spatial
import time
#Create some coordinates and indices
#It is assumed that the coordinates are unique (only one entry per hydrant)
ncities = 200064
Coords=np.random.rand(ncities*2).reshape(ncities,2)
Coords*=100
Indices=np.arange(ncities) #Indices
nnear = 33
def get_indices_of_nearest_neighbours(Coords,Indices):
tree=spatial.cKDTree(Coords)
#k=4 because the first entry is the nearest neighbour
# of a point with itself
res=tree.query(Coords, k=nnear)[1][:,1:]
return Indices[res]
start = time.time()
a = get_indices_of_nearest_neighbours(Coords, Indices)
end = time.time()
print (a.shape[0],a.shape[1])
print
print(end -start)
# number of cities, should be divisible by threadblock size
N = ncities
# number of "closest" cities
TN = nnear - 1
#number of threads per block
threadsperblock = 64
#shared memory size of each array in elements
smSize=TN*threadsperblock
@cuda.jit('int32(float32[:])',device=True)
def find_max(dist):
idx = cuda.threadIdx.x
mymax = 0
maxd = dist[idx]
for i in range(TN-1):
idx += threadsperblock
mynext = dist[idx]
if maxd < mynext:
mymax = i+1
maxd = mynext
return mymax
@cuda.jit('void(float32[:], float32[:], float32[:], int32[:], int32)')
def k(citiesx, citiesy, td, ti, nc):
dist = cuda.shared.array(smSize, float32)
city = cuda.shared.array(smSize, int32)
idx = cuda.grid(1)
tid = cuda.threadIdx.x
wl = tid & 31
my_city_x = citiesx[idx];
my_city_y = citiesy[idx];
last_max = 99999.0
for i in range(TN):
dist[tid+i*threadsperblock] = 99999.0
for i in xrange(0,nc,32):
city_id = i+wl
next_city_x = citiesx[city_id]
next_city_y = citiesy[city_id]
for j in range(32):
if j > 0:
src_lane = (wl+1)&31
city_id = cuda.shfl_sync(0xFFFFFFFF, city_id, src_lane)
next_city_x = cuda.shfl_sync(0xFFFFFFFF, next_city_x, src_lane)
next_city_y = cuda.shfl_sync(0xFFFFFFFF, next_city_y, src_lane)
if idx != city_id:
my_dist = (next_city_x - my_city_x)*(next_city_x - my_city_x) + (next_city_y - my_city_y)*(next_city_y - my_city_y)
if my_dist < last_max:
mymax = find_max(dist)
last_max = dist[mymax*threadsperblock+tid]
if my_dist < last_max:
dist[mymax*threadsperblock+tid] = my_dist
city[mymax*threadsperblock+tid] = city_id
for i in range(TN):
mymax = find_max(dist)
td[idx+i*nc] = math.sqrt(dist[mymax*threadsperblock+tid])
ti[idx+i*nc] = city[mymax*threadsperblock+tid]
dist[mymax*threadsperblock+tid] = 0
#peform test
cx = np.zeros(N).astype(np.float32)
cy = np.zeros(N).astype(np.float32)
for i in range(N):
cx[i] = Coords[i,0]
cy[i] = Coords[i,1]
td = np.zeros(N*TN, dtype=np.float32)
ti = np.zeros(N*TN, dtype=np.int32)
start = time.time()
d_cx = cuda.to_device(cx)
d_cy = cuda.to_device(cy)
d_td = cuda.device_array_like(td)
d_ti = cuda.device_array_like(ti)
k[N/threadsperblock, threadsperblock](d_cx, d_cy, d_td, d_ti, N)
d_td.copy_to_host(td)
d_ti.copy_to_host(ti)
end = time.time()
print(a)
print
print(end - start)
print(np.flip(np.transpose(ti.reshape(nnear-1,ncities)), 1))
$ python t27.py
(200064, 32)
1.32144594193
[[177220 25281 159413 ..., 133366 45179 27508]
[ 56956 163534 90723 ..., 135081 33025 167104]
[ 88708 42958 162851 ..., 115964 77180 31684]
...,
[186540 52500 124911 ..., 157102 83407 38152]
[151053 144837 34487 ..., 171148 37887 12591]
[ 13820 199968 88667 ..., 7241 172376 51969]]
44.1796119213
[[177220 25281 159413 ..., 133366 45179 27508]
[ 56956 163534 90723 ..., 135081 33025 167104]
[ 88708 42958 162851 ..., 115964 77180 31684]
...,
[186540 52500 124911 ..., 157102 83407 38152]
[151053 144837 34487 ..., 171148 37887 12591]
[ 13820 199968 88667 ..., 7241 172376 51969]]
$
在这个非常慢的 Quadro K2000 GPU 上,CPU/scipy-based kd-tree 算法比这个 GPU 实现快得多。然而,在 ~1.3s 时,scipy 实现仍然比 Tesla P100 上的蛮力 CUDA C++ 代码 运行ning 慢一点,而且现在有更快的 GPU。 here. As pointed out there, the brute force approach is easily parallelizable, whereas the kd-tree approach may be non-trivial to parallelize. In any event, a faster solution yet might be a GPU-based kd-tree implementation.
给出了这种差异的可能解释