在 Python 中使用 OpenCL 并行化循环

Question

我在矩阵 y 中有一个给定的数据集，我想用它训练不同的 SOM。 SOM 是一维的（一条线），其神经元数量各不相同。我首先训练了一个大小为 N=2 的 SOM，最后训练了 N=NMax，总共给出了 NMax-2+1 个 SOM。对于每个 SOM，我想在训练结束后存储权重，然后再继续下一个 SOM。

这里使用 PyOpenCL 的全部意义在于每个外部循环都独立于其他循环。即，对于 N 的每个值，脚本不关心当 N 取其他值时会发生什么。将脚本运行手动更改 N 的值 NMax-2+1 次可能会得到相同的结果。

考虑到这一点，我希望能够使用 GPU 同时执行这些独立迭代中的每一个，从而显着减少花费的时间。但是速度的提升会小于1/(NMax-2+1)，因为每次迭代的代价都比之前的要大，N的值越大，计算量就越大。

有没有办法在 GPU 上将此代码 'translate' 转换为运行？我以前从未使用过 OpenCL，所以让我知道这是否过于宽泛或愚蠢，以便我可以提出更具体的问题。代码是自包含的，所以尽管尝试一下 out.The 开头声明的四个常量可以更改为您喜欢的任何内容（假设 NMax > 1 和其他所有严格为正数）。

import numpy as np
import time

m = 3 # Dimension of datapoints
num_points = 2000 # Number of datapoints
iterMax = 150 # Maximum number of iterations
NMax = 3 # Maximum number of neurons
#%%
np.random.seed(0)
y = np.random.rand(num_points,m) # Generate always the same dataset
sigma_0 = 5 # Initial value of width of the neighborhood function
eta_0 = 1 # Initial value of learning rate
w = list(range(NMax - 1))
wClusters = np.zeros((np.size(y,axis = 0),NMax - 1)) # Clusters for each N

t_begin = time.clock() # Start time
for N in range(NMax-1): # Number of neurons for this iteration
    w[N] = np.random.uniform(0,1,(N+2,np.size(y,axis=1))) - 0.5 # Initialize weights
    iterCount = 1
    while iterCount < iterMax:
        # Mix up the input patterns
        mixInputs = y[np.random.permutation(np.size(y,axis = 0)),:]
        # Sigma reduction
        sigma = sigma_0 - (sigma_0/(iterMax + 1)) * iterCount
        s2 = 2*sigma**2
        # Learning rate reduction
        eta = eta_0 - (eta_0/(iterMax + 1)) * iterCount
        for selectedInput in mixInputs: # Pick up one pattern
            # Search winning neuron
            aux = np.sum((selectedInput - w[N])**2, axis = -1)
            ii = np.argmin(aux) # Neuron 'ii' is the winner
            jjs = abs(ii - list(range(N+2)))
            dists = np.min(np.vstack([jjs , abs(jjs-(N+2))]), axis = 0)
            # Update weights
            w[N] = w[N] + eta * np.exp((-dists**2)/s2).T[:,np.newaxis] * (selectedInput - w[N])
        print(N+2,iterCount)
        iterCount += 1    
    # Assign each datapoint to its nearest neuron
    for kk in range(np.size(y,axis = 0)):
        aux = np.sum((y[kk,] - w[N])**2,axis=-1)
        ii = np.argmin(aux) # Neuron 'ii' is the winner
        wClusters[kk,N] = ii + 1
t_end = time.clock() # End time
#%%
print(t_end - t_begin)

Answer 1

我试图给出一个比较完整的答案。

首先：

能否使用 (py)OpenCL 将此代码改编为运行在 GPU 上？

很可能是的。

这可以自动完成吗？

否（据我所知）。

我得到的关于 OpenCL 的大多数问题都是这样的："Is it worth porting this piece of code to OpenCL for a speedup gain?" 你是在说，你的外循环独立于其他运行的结果，这使得代码基本上是可并行的。在一个简单的实现中，每个 OpenCL 工作元素将执行相同的代码，但输入参数略有不同。不考虑主机和设备之间数据传输的开销，这种方法的运行ning 时间将等于最慢迭代的运行ning 时间。根据外循环中的迭代，这可能是一个巨大的速度增益。只要数字保持相对较小，您就可以尝试 python 中的 multiprocessing 模块在 CPU 而不是 GPU 上并行化这些迭代。

移植到 GPU 通常只有在大量进程要运行并行（大约 1000 或更多）时才有意义。所以在你的情况下，如果你真的想要一个巨大的速度提升，看看你是否可以并行化所有计算 inside 循环。例如，您有 150 次迭代和 2000 个数据点。如果你能以某种方式并行化这 2000 个数据点，这可以提供更大的速度增益，这可以证明将整个代码移植到 OpenCL 的工作是合理的。

TL;DR: 首先尝试在 CPU 上并行化。如果您发现需要同时运行超过数百个进程，请移至 GPU。

更新： 使用多处理（无回调）在 CPU 上并行化的简单代码

import numpy as np
import time
import multiprocessing as mp

m = 3 # Dimension of datapoints
num_points = 2000 # Number of datapoints
iterMax = 150 # Maximum number of iterations
NMax = 10 # Maximum number of neurons
#%%
np.random.seed(0)
y = np.random.rand(num_points,m) # Generate always the same dataset
sigma_0 = 5 # Initial value of width of the neighborhood function
eta_0 = 1 # Initial value of learning rate
w = list(range(NMax - 1))
wClusters = np.zeros((np.size(y,axis = 0),NMax - 1)) # Clusters for each N

def neuron_run(N):
    w[N] = np.random.uniform(0,1,(N+2,np.size(y,axis=1))) - 0.5 # Initialize weights
    iterCount = 1
    while iterCount < iterMax:
        # Mix up the input patterns
        mixInputs = y[np.random.permutation(np.size(y,axis = 0)),:]
        # Sigma reduction
        sigma = sigma_0 - (sigma_0/(iterMax + 1)) * iterCount
        s2 = 2*sigma**2
        # Learning rate reduction
        eta = eta_0 - (eta_0/(iterMax + 1)) * iterCount
        for selectedInput in mixInputs: # Pick up one pattern
            # Search winning neuron
            aux = np.sum((selectedInput - w[N])**2, axis = -1)
            ii = np.argmin(aux) # Neuron 'ii' is the winner
            jjs = abs(ii - list(range(N+2)))
            dists = np.min(np.vstack([jjs , abs(jjs-(N+2))]), axis = 0)
            # Update weights
            w[N] = w[N] + eta * np.exp((-dists**2)/s2).T[:,np.newaxis] * (selectedInput - w[N])
        print(N+2,iterCount)
        iterCount += 1    
    # Assign each datapoint to its nearest neuron
    for kk in range(np.size(y,axis = 0)):
        aux = np.sum((y[kk,] - w[N])**2,axis=-1)
        ii = np.argmin(aux) # Neuron 'ii' is the winner
        wClusters[kk,N] = ii + 1

t_begin = time.clock() # Start time   
#%%

def apply_async():
    pool = mp.Pool(processes=NMax)
    for N in range(NMax-1):
        pool.apply_async(neuron_run, args = (N,))
    pool.close()
    pool.join()
    print "Multiprocessing done!"

if __name__ == '__main__':
    apply_async()

t_end = time.clock() # End time 
print(t_end - t_begin)

在 Python 中使用 OpenCL 并行化循环

Parallelize loops using OpenCL in Python

python

parallel-processing

gpu

opencl

pyopencl