如何将输入矩阵拆分为子矩阵以进行并行 GPU 处理？

Question

我是 Whosebug 的新手，很高兴 post 我的第一个问题。我实际上是编程领域的新手，但无论如何我都有一些基本知识，我实际上是在使用 Python 来解决分类问题。我有一大组数据（在我的实际问题中 n=60326），每个向量大约有 416 维。我需要计算这些不同向量之间的曼哈顿距离，以便根据相似性对我的数据集进行分类（采用随机参考向量并将最接近的向量合并到它，距离范围在 0 和 1 之间）我已经熟悉 Kmeans 和基本的 ML 聚类算法...我的实际问题是我需要使用 GPU (CUDA) 加速时间，以便在开始分类之前初步计算距离矩阵，其大小为 n²（60326 x 60326，我们可以将其减少到 n²/2因为它是一个对称矩阵）所以我的实际问题是在这种情况下如何实现 CUDA，我已经用 ANACONDA 安装了 CUDA 包。

我从并行 CPU 的处理开始，它提供了内存错误

import pandas as pd
import time  
import time
import numpy as np
import random
from scipy.spatial import distance
from sklearn.metrics.pairwise import pairwise_distances
signatures=pd.read_csv("C:\Users\YMH1\Documents\Reduce Java code\BOBST.txt", sep=' ',header=None,usecols=[*range(2,417)])
PartName = pd.read_csv('C:\Users\YMH1\Documents\Reduce Java code\misumi_new.txt', sep=' ', header=None, usecols=[*range(0,1)])
signatures_vector=np.array(signatures)
PartName_vector=np.array(PartName)
D = pairwise_distances(X = signatures_vector, metric = 'manhattan', n_jobs = -1)
print(D)

现在，我正在尝试实施 CUDA，因为它可以加快时间，所以我编写了以下代码：

from __future__ import division
from numba import cuda, float32
import pandas as pd
import math
signatures=pd.read_csv("C:\Users\YMH1\Documents\Reduce Java code\BOBST.txt", sep=';',header=None,usecols=[*range(2,418)])
signatures_vector=np.array(signatures)
@cuda.jit
def manhattan(an_array):
  x, y = cuda.grid(2)
  "Here we define the Manhattan distance"
  return an_array
data=signatures_vector
threadsperblock = (16, 16)
blockspergrid_x = math.ceil(data.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(data.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)
manhattan[blockspergrid, threadsperblock](data)
print(data)

我的问题是在使用CUDA的时候怎么定义manhattan norm，def manhattan里面要修改什么非常感谢你

Answer 1

My question is how to define the manhattan norm when we use CUDA, what are the modifications to be made inside the def manhattan

根据documentation，manhattan 距离度量是元素差异绝对值的两个向量之间的总和。

您可能运行遇到的一个问题是内存问题 space。如果我们假设距离度量（即矩阵元素）的输出表示为一个普通的 python 量，这可能会占用 8 个字节的内存。对于规定的维度 (60326)，这意味着矩阵将占用 60326*60326*8 字节，几乎是 30GB。即使我们假设您只存储其中的一半，即使我们假设绝对差的总和为 32 位，这仍然需要超过 7GB 的存储空间。

当我尝试运行使用 sckit-learn 方法进行这样的测试时，我遇到了麻烦，即使是在具有 128GB 系统内存的机器上也是如此：

# cat t5.py
import numpy as np
import random
from scipy.spatial import distance
from sklearn.metrics.pairwise import pairwise_distances

vector_length = 416
num_signatures = 60000
sig_pattern = np.array([0,1,2,3], dtype=np.float32).reshape(4,1)
signatures = np.tile(sig_pattern,(num_signatures, vector_length//sig_pattern.shape[1]))
E = pairwise_distances(signatures, metric = 'manhattan', n_jobs = -1)
print(E[:8,:8])
# time python t5.py

Traceback (most recent call last):
  File "t5.py", line 17, in <module>
    E = pairwise_distances(signatures, metric = 'manhattan', n_jobs = -1)
  File "/root/anaconda2/lib/python2.7/site-packages/sklearn/metrics/pairwise.py", line 1247, in pairwise_distances
    return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
  File "/root/anaconda2/lib/python2.7/site-packages/sklearn/metrics/pairwise.py", line 1096, in _parallel_pairwise
    for s in gen_even_slices(Y.shape[0], n_jobs))
  File "/root/anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 789, in __call__
    self.retrieve()
  File "/root/anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 699, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/root/anaconda2/lib/python2.7/multiprocessing/pool.py", line 572, in get
    raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result:
'[array([[ 0.,  416.,  832., ...,  416.,  832., 1248.],
       [ 416.,    0.,  416., ...,    0.,  416.,  832.],
       [ 832.,  416.,    0., ...,  416.,    0.,  416.],
       ...,
       [ 416.,    0.,  416., ...,    0.,  416.,  832.],
       [ 832.,  416.,    0., ...,  416.,    0.,  416.],
       [1248.,  832.,  416., ...,  832.,  416.,    0.]])]'. Reason: 'OverflowError('cannot serialize a string larger than 2 GiB',)'

real    31m47.361s
user    405m28.155s
sys     8m19.851s

在那种情况下，在我的机器上计算似乎需要大约 30 分钟。输出测试矩阵看起来大致正确，但 python 抛出错误，因为某些中间表示 >2GB。

内存大小问题也将是 numba/cuda 实现中需要考虑的重要问题之一。

不过，要执行的操作相对简单。根据我的测试，它运行比 numpy/scikit-learn 方法快得多。

这是一个有效的例子：

# cat t4.py
import numpy as np
import numba as nb
from numba import cuda,float32

vector_length = 416
num_vecs_per_block = 8
sm_size = vector_length*num_vecs_per_block
#num_sig must be divisible by num_vecs_per_block
@cuda.jit('void(float32[:,:], float32[:,:], int32, int32)')
def manhattan(signatures, distances, num_sig, vec_len):
    sm = cuda.shared.array(sm_size, float32)
    temp = cuda.local.array(num_vecs_per_block, dtype = float32)
    bid = cuda.blockIdx.x
    tid = cuda.threadIdx.x
    bdim = cuda.blockDim.x
# load shared memory with vectors for comparison
    if tid < num_vecs_per_block:
        for i in range(vec_len):
            sm[i*num_vecs_per_block+tid] = signatures[i, bid*num_vecs_per_block+tid];
    cuda.syncthreads()
#block-stride loop through the vector array
# the addition below to tid results in only elements above the diagonal being computed
# since the output matrix is symmetric
    svec = tid +(bid*num_vecs_per_block)
    while svec < num_sig:
        for i in range(num_vecs_per_block):
            temp[i] = 0
        for i in range(vec_len):
            src = signatures[i,svec]
            for j in range(num_vecs_per_block):
                #elementwise difference
                sdist = src - sm[i*num_vecs_per_block + j]
                #absolute value
                if sdist < 0:
                    sdist *= -1
                #sum
                temp[j] += sdist
        for i in range(num_vecs_per_block):
            distances[bid*num_vecs_per_block+i, svec] = temp[i]
        svec += bdim

num_signatures = 60000
sig_pattern = np.array([0,1,2,3], dtype=np.float32)
signatures = np.tile(sig_pattern,(num_signatures//sig_pattern.shape[0], vector_length))
distances  = np.zeros((num_signatures, num_signatures), dtype=np.float32)
threadsperblock = 1024
blockspergrid   = num_signatures//num_vecs_per_block
d_signatures = cuda.to_device(signatures)
d_distances = cuda.device_array_like(distances)
manhattan[blockspergrid, threadsperblock](d_signatures, d_distances, num_signatures, vector_length)
d_distances.copy_to_host(distances)
print(distances[:16,:16])
# time python t4.py
[[   0.  416.  832. 1248.    0.  416.  832. 1248.    0.  416.  832. 1248.    0.  416.  832. 1248.]
 [ 416.    0.  416.  832.  416.    0.  416.  832.  416.    0.  416.  832.  416.    0.  416.  832.]
 [ 832.  416.    0.  416.  832.  416.    0.  416.  832.  416.    0.  416.  832.  416.    0.  416.]
 [1248.  832.  416.    0. 1248.  832.  416.    0. 1248.  832.  416.    0. 1248.  832.  416.    0.]
 [   0.  416.  832. 1248.    0.  416.  832. 1248.    0.  416.  832. 1248.    0.  416.  832. 1248.]
 [ 416.    0.  416.  832.  416.    0.  416.  832.  416.    0.  416.  832.  416.    0.  416.  832.]
 [ 832.  416.    0.  416.  832.  416.    0.  416.  832.  416.    0.  416.  832.  416.    0.  416.]
 [1248.  832.  416.    0. 1248.  832.  416.    0. 1248.  832.  416.    0. 1248.  832.  416.    0.]
 [   0.    0.    0.    0.    0.    0.    0.    0.    0.  416.  832. 1248.    0.  416.  832. 1248.]
 [   0.    0.    0.    0.    0.    0.    0.    0.  416.    0.  416.  832.  416.    0.  416.  832.]
 [   0.    0.    0.    0.    0.    0.    0.    0.  832.  416.    0.  416.  832.  416.    0.  416.]
 [   0.    0.    0.    0.    0.    0.    0.    0. 1248.  832.  416.    0. 1248.  832.  416.    0.]
 [   0.    0.    0.    0.    0.    0.    0.    0.    0.  416.  832. 1248.    0.  416.  832. 1248.]
 [   0.    0.    0.    0.    0.    0.    0.    0.  416.    0.  416.  832.  416.    0.  416.  832.]
 [   0.    0.    0.    0.    0.    0.    0.    0.  832.  416.    0.  416.  832.  416.    0.  416.]
 [   0.    0.    0.    0.    0.    0.    0.    0. 1248.  832.  416.    0. 1248.  832.  416.    0.]]

real    0m11.580s
user    0m5.579s
sys     0m5.922s
#

在这种情况下，我将 signatures 向量的总数限制为 60000，而不是 60320，因为较大的数字导致输出矩阵太大，无法容纳我的 16GB Tesla 的可用内存P100 显卡。 如果您的 GPU 内存小于 16GB，此代码将无法按原样运行。您需要将问题缩小到较少数量的签名向量。把向量分成两组，计算两组之间的距离，填满整个矩阵，应该是比较直接的。

然而，我的测试机器上的 numba 代码运行s 只用了大约 11 秒，并且似乎与我正在打印的非常小的 16x16 输出矩阵块产生了等效的结果。

如果我分析这段代码，我实际上发现 GPU 内核在大约 3 秒内运行ning，从 GPU 到 CPU 的巨大输出矩阵的数据传输时间大约需要6 秒，剩下的是大约 2 秒的 python 开销。

实际的GPU算法是面向块的。每个块负责将 8 个向量与整个向量数组进行比较。每个块首先将 8 个向量加载到共享内存中，然后遍历整个向量数组，针对这 8 个向量中的每一个计算曼哈顿距离。该块使用块步幅循环运行遍历整个数组。在每次循环迭代结束时，将当前比较对应的8个曼哈顿距离写入输出数组。

此外，代码有细微的变化，因此仅计算对角线以上的矩阵输出值。由于计算是按块进行的，因此不计算对角线上没有元素的块。这将处理时间缩短了大约一半，但完整的输出矩阵仍在内存中。为此，我上面16x16输出的左下象限全为零，因为8x8象限完全是"below"对角线，所以跳过对它的处理，由于相似情况已经指出问题。

如何将输入矩阵拆分为子矩阵以进行并行 GPU 处理？

How to split input matrix to sub-matrixes for parallel GPU processing?

python

cuda

numba