如何在独立显卡 AMD GPU 上 运行 Python 编写脚本?

How to run Python script on a Discrete Graphics AMD GPU?

WHAT I WANT TO DO:

我有一个脚本,用于在给定范围内分解质数:

# Python program to display all the prime numbers within an interval

lower = 900
upper = 1000

print("Prime numbers between", lower, "and", upper, "are:")

for num in range(lower, upper + 1):
   # all prime numbers are greater than 1
   if num > 1:
       for i in range(2, num):
           if (num % i) == 0:
               break
       else:
           print(num)

我想使用 GPU 而不是 CPU 到 运行 这样的脚本,这样速度会更快

THE PROBLEM:

我的 Intel NUC NUC8i7HVK but a "Discrete GPU"

上没有 NVIDIA GPU

如果我 运行 此代码检查我的 GPU 是什么:

import pyopencl as cl
import numpy as np

a = np.arange(32).astype(np.float32)
res = np.empty_like(a)

ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)

mf = cl.mem_flags
a_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a)
dest_buf = cl.Buffer(ctx, mf.WRITE_ONLY, res.nbytes)

prg = cl.Program(ctx, """
    __kernel void sq(__global const float *a,
    __global float *c)
    {
      int gid = get_global_id(0);
      c[gid] = a[gid] * a[gid];
    }
    """).build()

prg.sq(queue, a.shape, None, a_buf, dest_buf)

cl.enqueue_copy(queue, res, dest_buf)

print (a, res)

我收到:

THE POSSIBLE APPROACH TO THE PROBLEM:

我发现了一个guide that takes you by the hand and explains step by step how to run it on your GPU. But all Pyhton libraries that pipes Python through the GPU like PyOpenGL, PyOpenCL, Tensorflow ()、PyTorch等...都是为NVIDIA量身定做的

如果你有 AMD 所有库都要求 ROCm 但据我所知,此类软件仍然不支持集成 GPU 或独立 GPU(请参阅下面我自己的回复)。

我只找到了一个 guide 谈论这种方法,但我无法使它起作用。

有希望还是我只是想做一些不可能的事情?

EDIT: Reply to @chapelo

如果我选择0,回复是:

Set the environment variable PYOPENCL_CTX='0' to avoid being asked again.
[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14. 15. 16. 17.
 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31.] [  0.   1.   4.   9.  16.  25.  36.  49.  64.  81. 100. 121. 144. 169.
 196. 225. 256. 289. 324. 361. 400. 441. 484. 529. 576. 625. 676. 729.
 784. 841. 900. 961.]

如果我选择1,回复是:

Set the environment variable PYOPENCL_CTX='1' to avoid being asked again.
[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14. 15. 16. 17.
 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31.] [  0.   1.   4.   9.  16.  25.  36.  49.  64.  81. 100. 121. 144. 169.
 196. 225. 256. 289. 324. 361. 400. 441. 484. 529. 576. 625. 676. 729.
 784. 841. 900. 961.]

经过大量研究和多次尝试,我得出结论:

  1. PyOpenGL: 主要与 NVIDIA 配合使用。如果您有 AMD GPU,则需要安装 ROCm
  2. PyOpenCL: 主要与 NVIDIA 配合使用。如果您有 AMD GPU,则需要安装 ROCm
  3. TensorFlow: 主要与 NVIDIA 配合使用。如果您有 AMD GPU,则需要安装 ROCm
  4. PyTorch: 主要与 NVIDIA 配合使用。如果您有 AMD GPU,则需要安装 ROCm

我安装了 ROCm 但如果我 运行 rocminfo 它 returns:

ROCk module is NOT loaded, possibly no GPU devices
Unable to open /dev/kfd read-write: No such file or directory
Failed to get user name to check for video group membership
hsa api call failure at: /src/rocminfo/rocminfo.cc:1142
Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.

clinfo returns:

Number of platforms                               1
  Platform Name                                   AMD Accelerated Parallel Processing
  Platform Vendor                                 Advanced Micro Devices, Inc.
  Platform Version                                OpenCL 2.0 AMD-APP (3212.0)
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_icd cl_amd_event_callback
  Platform Extensions function suffix             AMD

  Platform Name                                   AMD Accelerated Parallel Processing
Number of devices                                 0

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  No platform
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   No platform
  clCreateContext(NULL, ...) [default]            No platform
  clCreateContext(NULL, ...) [other]              No platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  No devices found in platform

rocm-smi returns:

Segmentation fault

这是因为在 official guide 中它说 “Ryzen 的集成 GPU 不是 ROCm 官方支持的目标。” 因为我的是集成 GPU 我超出范围。

我不会再浪费时间了,可能会购买 NVIDIA 或 AMD eGPU(外部 GPU)

pyopencl 确实适用于您的 AMD 和英特尔 GPU。并且您检查了您的安装是否正常工作。只设置你的环境变量 PYOPENCL_CTX='0' 每次都使用 AMD 而不会被询问。

或者不使用 ctx = cl.create_some_context(),您可以使用以下方法在程序中定义上下文:

platforms = cl.get_platforms()
ctx = cl.Context(
   dev_type=cl.device_type.ALL,
   properties=[(cl.context_properties.PLATFORM, platforms[0])])

不要理所当然地认为您的 AMD 在任何情况下都比您的 Intel 好。我遇到过英特尔超越另一个的案例。我认为这与将 CPU 之外的数据复制到另一个 GPU 的成本有关。

话虽如此,我认为与拥有更好的算法相比,运行并行执行您的脚本不会有太大改进:

  • 使用筛选算法,得到质数直到你的上数的平方根。
  • 应用类似的筛选算法,使用上一步中的素数从下限到上限筛选数字。

也许这不是一个可以轻松 运行 并行算法的好例子,但你们都准备尝试另一个例子。

但是,为了向您展示如何使用 GPU 解决此问题,请考虑以下更改:

串行算法如下所示:

from math import sqrt

def primes_below(number):
    n = lambda a: 2 if a==0 else 2*a + 1
    limit = int(sqrt(number)) + 1
    size = number//2
    primes = [True] * size
    for i in range(1, size):
        if primes[i]:
            num = n(i)
            for j in range(i+num, size, num):
                primes[j] = False
    for i, flag in enumerate(primes):
        if flag: yield n(i)

def primes_between(lo, hi):
    primes = list(primes_below(int(sqrt(hi))+1))
    size = (hi - lo - (0 if hi%2 else 1))//2 + 1
    n = lambda a: 2*a + lo + (0 if lo%2 else 1)
    numbers = [True]*size
    for i, prime in enumerate(primes):
        if i == 0: continue
        start = 0
        while (n(start)%prime) != 0: 
            start += 1
        for j in range(start, size, prime):
            numbers[j] = False
    for i, flag in enumerate(numbers):
        if flag: yield n(i)

这会在 0.64 秒内打印出 1e6 和 5e6 之间的素数列表

尝试在我的 GPU 上使用您的脚本在 5 分钟内没有完成。对于一个小 10 倍的问题:1e5 和 5e5 之间的素数,大约需要 29 秒。

修改脚本,使 GPU 中的每个进程都将一个奇数(测试偶数没有意义)除以一系列预先计算的素数,直到上数的平方根,如果素数是大于数字本身的平方根,它在 0.50 秒内完成相同的任务。这是一个进步!

代码如下:

import numpy as np
import pyopencl as cl
import pyopencl.algorithm
import pyopencl.array

def primes_between_using_cl(lo, hi):

    primes = list(primes_below(int(sqrt(hi))+1))

    numbers_h = np.arange(  lo + (0 if lo&1 else 1), 
                            hi + (0 if hi&1 else 1),
                            2,
                            dtype=np.int32)

    size = (hi - lo - (0 if hi%2 else 1))//2 + 1

    code = """\
    __kernel 
    void is_prime( __global const int *primes,
                   __global       int *numbers) {
      int gid = get_global_id(0);
      int num = numbers[gid];
      int max = (int) (sqrt((float)num) + 1.0);
      for (; *primes; ++primes) {
   
        if (*primes <= max && num % *primes == 0) {
          numbers[gid] = 0;
          return;
        }
      }
    }
    """

    platforms = cl.get_platforms()
    ctx = cl.Context(dev_type=cl.device_type.ALL,
       properties=[(cl.context_properties.PLATFORM, platforms[0])])     
    queue = cl.CommandQueue(ctx)
    prg = cl.Program(ctx, code).build()
    numbers_d = cl.array.to_device(queue, numbers_h)

    primes_d = cl.array.to_device(queue,
                                  np.array(primes[1:], # don't need 2
                                  dtype=np.int32))

    prg.is_prime(queue, (size, ), None, primes_d.data, numbers_d.data)

    array, length = cl.algorithm.copy_if(numbers_d, "ary[i]>0")[:2]

    yield from array.get()[:length.get()]

以下代码是完整 python 程序的示例,通常包括:

  • 导入语句
  • 函数定义
  • main()函数
  • if __name__ == "__main__": 部分。

希望能帮到您解决问题。

import pyprimes
from math import sqrt
import numpy as np

import pyopencl as cl
import pyopencl.algorithm
import pyopencl.array

def primes_below(number):
    """Generate a list of prime numbers below a specified  `number`"""
    n = lambda a: 2 if a==0 else 2*a + 1
    limit = int(sqrt(number)) + 1
    size = number//2
    primes = [True] * size
    for i in range(1, size):
        if primes[i]:
            num = n(i)
            if num > limit: break
            for j in range(i+num, size, num):
                primes[j] = False
    for i, flag in enumerate(primes):
        if flag:
            yield n(i)

def primes_between(lo, hi):
    """Generate a list of prime numbers betwenn `lo` and `hi` numbers"""
    primes = list(primes_below(int(sqrt(hi))+1))
    size = (hi - lo - (0 if hi%2 else 1))//2 + 1
    n = lambda a: 2*a + lo + (0 if lo%2 else 1)
    numbers = [True]*size
    for i, prime in enumerate(primes):
        if i == 0: continue # avoid dividing by 2
        nlo = n(0)
        # slower # start = prime * (nlo//prime + 1) if nlo%prime else 0
        start = 0
        while (n(start)%prime) != 0: 
            start += 1
        for j in range(start, size, prime):
            numbers[j] = False
    for i, flag in enumerate(numbers):
        if flag:
            yield n(i)

def primes_between_using_cl(lo, hi):
    """Generate a list of prime numbers betwenn a lo and hi numbers
    this is a parallel algorithm using pyopencl"""
    primes = list(primes_below(int(sqrt(hi))+1))
    size_primes_h = np.array( (len(primes)-1, ), dtype=np.int32)
    numbers_h = np.arange(  lo + (0 if lo&1 else 1), 
                                  hi + (0 if hi&1 else 1),
                                  2,
                                  dtype=np.int32)
    size = (hi - lo - (0 if hi%2 else 1))//2 + 1
    code = """\
    __kernel 
    void is_prime( __global const int *primes,
                        __global         int *numbers) {
      int gid = get_global_id(0);
      int num = numbers[gid];
      int max = (int) (sqrt((float)num) + 1.0);
      for (; *primes; ++primes) {
         if (*primes > max) break;
         if (num % *primes == 0) {
            numbers[gid] = 0;
            return;
         }
      }
    }
    """
    platforms = cl.get_platforms()
    ctx = cl.Context(dev_type=cl.device_type.ALL,
        properties=[(cl.context_properties.PLATFORM, platforms[0])])
    queue = cl.CommandQueue(ctx)
    prg = cl.Program(ctx, code).build()
    numbers_d = cl.array.to_device(queue, numbers_h)
    primes_d = cl.array.to_device(queue, np.array(primes[1:], dtype=np.int32))
    prg.is_prime(queue, (size, ), None, primes_d.data, numbers_d.data)
    array, length = cl.algorithm.copy_if(numbers_d, "ary[i]>0")[:2]
    yield from array.get()[:length.get()]

def test(f, lo, hi):
    """Test that all prime numbers are generated by comparing with the
    output of the library `pyprimes`"""
    a = filter(lambda p: p>lo, pyprimes.primes_below(hi))
    b = f(lo, hi)
    result = True
    for p, q in zip (a, b):
        if p != q:
            print(p, q)
            result = False
    return result
    
def main():
    lower = 1000
    upper = 5000
    print("The prime numbers between {} and {}, are:".format(lower,upper))
    print()
    for p in primes_between_using_cl(lower, upper):
        print(p, end=' ')
    print()

if __name__ == '__main__':
    main()