关于numba中CUDA线程的一个简单问题

Question

这是一个非常适合初学者的问题。我一直在研究常规 python 线程和 C 线程，并了解到我可以创建运行特定功能的线程，并且它们使用信号量和其他同步原语。

但是，我目前正在尝试使用基于 numba python 的编译器来学习 Cuda。我写了下面的代码。

from numba import cuda
import numpy as np

@cuda.jit
def image_saturate(data):
    pos_x, pos_y = cuda.grid(2)

    if (pos_x, pos_y) <= data.shape:
        data[pos_x, pos_y] = 1

if __name__ == "__main__":
    image_quality = (128, 72)
    image = np.zeros(image_quality)

    thread_size = 32
    block_size = image_quality

    image_saturate[block_size, thread_size](image)

    print(image)

但是，我觉得奇怪的是，我可以根据需要更改 thread_size，结果是一样的——这意味着输出都是预期的结果。但是，当我更改 block_size 奇怪的事情开始发生并且只有原始矩阵的那个大小被填充 - 所以它只是部分填充。

这个表格我理解为cuda.grid(2) returns 块坐标。但是，难道我不能得到实际的线程坐标和块坐标吗？

我是新手，找不到任何在线学习资源。如果有人能回答我的问题并提供使用 Numba 学习 Cuda 的资源，那就太好了。

Answer 1

Form this I understand that the cuda.grid(2) returns the block coordinates.

事实并非如此。该语句 returns 是一个完全限定的 2D 线程索引。 returned 值的范围将扩展到块坐标限制和线程坐标限制的乘积。

在 CUDA 中，内核启动的网格维度参数（您调用的 block_size）根据块数（每个方向）指定网格维度。内核启动的块维度参数（您正在调用的 thread_size）根据每个块的线程数指定网格中每个块的大小。

因此，启动的线程总数等于网格维度和块维度的乘积。总数将是所有这些东西在所有维度上的乘积。每个维度的总数将是该方向的网格维度和该方向的块维度的乘积。

所以你的设计选择有问题，因为你有一个图像大小并且你将网格尺寸设置为等于图像大小。这只有在每个块只有 1 个线程时才有意义。正如您通过查看任何适当的 numba CUDA 代码（例如）会发现的那样，一种典型的方法是将所需的总维度（在本例中为图像大小或维度）除以每个块的线程数，以获得网格维度。

当我们这样做时，内核代码中的 cuda.grid() 语句将 return 一个具有合理范围的元组。在您的情况下，它将 return 元组到正确从 x 中的 0..127 到 y 中的 0..71 的线程。你现在遇到的问题是 cuda.grid() 语句可以 return x 范围从 0..((128*32)-1) 的元组，这是不必要的。

当然，您的 if 语句的目标是防止越界索引，但是 <= 的测试在我看来并不正确。这是经典的计算机科学中的 off-by-1 错误。索引恰好符合 shape 限制 return 的线程应该排除。

But, the moment I change the the block_size weird things start happening and only that size of the original matrix gets filled with ones - so it's only a partial filling.

真的不清楚您在这里的期望是什么。您的内核设计使得每个线程（最多）填充一个输出点。因此，合理的网格大小调整是使网格（x 中的总线程数和 y 中的总线程数）与图像尺寸相匹配。如果您遵循上述关于网格大小计算的建议，然后将您的网格大小设置为小于您的图像大小，我预计您的输出图像的部分将不会被填充。不要那样做。或者，如果您必须这样做，请使用 .

说了这么多，下面是我将如何重写你的代码：

from numba import cuda
import numpy as np

@cuda.jit
def image_saturate(data):
    pos_x, pos_y = cuda.grid(2)

    if pos_x  < data.shape[0] and pos_y < data.shape[1]:
        data[pos_x, pos_y] = 1

if __name__ == "__main__":
    image_x = 128
    image_y = 72
    image_quality = (image_x, image_y)
    image = np.zeros(image_quality)
    thread_x = 32
    thread_y = 1
    thread_size = (thread_x, thread_y)
    block_size = ((image_x//thread_x) + 1, (image_y//thread_y) + 1) # "lazy" round-up

    image_saturate[block_size, thread_size](image)

    print(image)

对我来说运行似乎是正确的。如果你现在建议你想做的是任意修改 block_size 变量例如：

block_size = (5,5)

并且不做其他更改，并期望输出图像完全填充，我想说这不是一个明智的期望。我不知道这怎么可能是明智的，所以我只想说 CUDA 不是那样工作的。如果您希望将数据大小与网格大小“分离”，那么规范的方法是已经讨论过的网格步幅循环。

我还删除了元组比较。我不认为这里真的有密切关系。如果您仍想使用元组比较，那应该完全按照您基于 python 的预期工作。没有任何关于 CUDA 的具体内容。

Answer 2

我来晚了，但我认为包含更多视觉组件的解释可能有助于使现有答案更加清晰。很难找到线程索引在 cuda 中如何工作的说明性答案。尽管一旦您理解了这些概念，您就很容易记住它们，但在理解的过程中可能会有很多混淆点，希望这对您有所帮助。

请原谅在脚本注释之外缺乏讨论，但这似乎是代码上下文有助于避免误解并有助于演示其他人讨论的索引概念的情况。因此，我将把脚本、其中的注释和压缩输出留给您。

要生成未压缩的输出，请参阅线程主块中的前几条注释。

示例脚本：

from numba import cuda
import numpy as np

@cuda.jit
def image_saturate(data):
    grid_pos_x, grid_pos_y = cuda.grid(2)
    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    bx = cuda.blockIdx.x
    by = cuda.blockIdx.y
    if (grid_pos_x, grid_pos_y) < data.shape:
        # note that the cuda device array retains the stride order of the original numpy array,
        # so the data is in row-major order (numpy default) and the first index (where we are
        # using grid_pos_x) doesn't actually map to the horizontal axis (aka the typical x-axis),
        # but rather maps to vertical axis (the typical y-axis).
        #
        # And, as you would then expecet, the second axis (where we are using grid_pos_y) maps
        # to the horizontal axis of the array.
        #
        # What you should take away from this observation is that the x,y labels of cuda elements
        # have no explicit connection to the array's memory layout.
        #
        # Therefore, it is up to you, the programmer to understand the memory layout for your
        # array (whether it's the C-like row-major, or the Fortran-like column-major), and how
        # you should map the (x,y) thread IDs onto the coordinates of your array.
        data[grid_pos_x, grid_pos_y,0,0] = tx
        data[grid_pos_x, grid_pos_y,0,1] = ty
        data[grid_pos_x, grid_pos_y,0,2] = tx + ty
        data[grid_pos_x, grid_pos_y,1,0] = bx
        data[grid_pos_x, grid_pos_y,1,1] = by
        data[grid_pos_x, grid_pos_y,1,2] = bx + by
        data[grid_pos_x, grid_pos_y,2,0] = grid_pos_x
        data[grid_pos_x, grid_pos_y,2,1] = grid_pos_y
        data[grid_pos_x, grid_pos_y,2,2] = grid_pos_x + grid_pos_y

if __name__ == "__main__":
    # uncomment the following line and remove the line after it
    # if you run this code to get more readable results
    # np.set_printoptions(linewidth=500)
    np.set_printoptions(linewidth=500,threshold=3) # compressed output for use on stack overflow
    # image_quality = (128, 72)
    # we are shrinking image_quality to be 23x21 to make the printout easier to read,
    # and intentionally step away from the alignment of threads per block being a
    # multiplicative factor to the image shape. THIS IS BAD for practical applications,
    # it's just helpful for this illustration.
    image_quality = (23,21)
    image = np.zeros(image_quality+(3,3),int)-1

    ## thread_size = 32 # commented to show where the original variable was used
    # below, we rename the variable to be more semantically clear
    #       Note: to define the desired thread-count in multiple axis, threads_per_block
    #             would have to become a tuple, E.G.:
    #               # defines 32 threads in thread-block's x axis, and 16 in the y
    #               threads_per_block = 32,16
    threads_per_block = 32
    #threads_per_block = 32  results in an implicit 1 for the y-axis, and implicit 1 in the z.
    ### Thread blocks are always 3d
    # this is also true for the thread grid the device will create for your kernel

    ## block_size = image_quality
    # renaming block_size to semantically more accurate variable name
    #   Note: As with the threads_per_block, any axis we don't explicitly specify a size
    #         for will be given the default value of 1. So, because image_quality gives 2 values,
    #         for x/y respectively, the z axis will implicitly be given size of 1.
    block_count = image_quality
    # REMEMBER: The thread/block/grid dimensions we are passing to the function compiler
    #           are NOT used to infer details about the arguments being passed to the
    #           compiled function (our image array)
    #           It is up to us to write code that appropriately utilizes the arrangement
    #           of the thread blocks the device will build for us once inside the kernel.
    #           SEE THE COMMENT INSIDE THE image_saturate function above.
    image_saturate[block_count, threads_per_block](image)
    print(f"{block_count=}; {threads_per_block=}; {image.shape=}")
    print("thread id within block; x")
    print(image[:,:,0,0])
    print("\nthread id within block; y"
          "\n-- NOTE 1 regarding all zeros: see comment at the end of printout")
    print(image[:,:,0,1])
    print("\nsum of x,y thread id within block")
    print(image[:,:,0,2])
    print("\nblock id within grid; x"
          "\n-- NOTE 2 also regarding all zeros: see second comment at the eod of printout")
    print(image[:,:,1,0])
    print("\nblock id within grid; y")
    print(image[:,:,1,1])
    print("\nsum of x,y block id within grid")
    print(image[:,:,1,2])
    print("\nthread unique global x id within full grid; x")
    print(image[:,:,2,0])
    print("\nthread unique global y id within full grid; y")
    print(image[:,:,2,1])
    print("\nsum of thread's unique global x,y ids")
    print(image[:,:,2,2])
    print(f"{'End of 32 threads_per_block output':-<70}")
    threads_per_block = 16
    # reset the values of image so we can be sure to see if any elements 
    # of the image go unassigned
    image *= 0
    image -= 1
    # block_count = image_quality # if you wanted to try
    print(f"\n\n{block_count=}; {threads_per_block=}; {image.shape=}")
    image_saturate[block_count, threads_per_block](image)
    print("thread id within block; x")
    print(image[:,:,0,0])
    print("\nthread id within block; y "
          "\n-- again, see NOTE 1")
    print(image[:,:,0,1])
    print("\nsum of x,y thread id within block")
    print(image[:,:,0,2])
    print("\nblock id within grid; x "
          "\n-- notice that unlike when we had 32 thread_per_block, not all 0")
    print(image[:,:,1,0])
    print("\nblock id within grid; y")
    print(image[:,:,1,1])
    print("\nsum of x,y block id within grid")
    print(image[:,:,1,2])
    print("\nthread unique global x id within full grid; x")
    print(image[:,:,2,0])
    print("\nthread unique global y id within full grid; y")
    print(image[:,:,2,1])
    print("\nsum of thread's unique global x,y ids")
    print(image[:,:,2,2])
    from textwrap import dedent
    print(dedent("""
    NOTE 1:
    The thread IDs recorded for 'thread id within block; y'
    are all zero for both versions of `threads_per_block` because we never
    specify the number of threads per block that should be created for
    the 'y' axis.

    So, the compiler defaults to creating only a single thread along those
    undefined axis of each block. For that reason, we see that the only
    threadID.y value stored is 0 for all i,j elements of the array.

    NOTE 2:
    **Note 2 mostly pertains to the case where threads_per_block == 32**
    The block IDs recorded for 'block id within grid; x' are all zero for
    both versions of `threads_per_block` results from similar reasons
    mentioned in NOTE 1.

    The size of a block, in any axis, is determined by the specified number
    of threads for that axis. In this example script, we define threads_per_block
    to have an explicit 32 threads in the x axis, leaving the compiler to give an
    implicit 1 for both the y and z axis. We then tell the compiler to create 23 blocks
    in the x-axis, and 21 blocks in the y; resulting in:
    \t* A kernel where the device creates a grid of blocks, 23:21:1 for 483 blocks
    \t\t* (x:y:z -> 23:21:1)
    \t* Where each block has 32 threads
    \t\t* (x:y:z -> 32:1:1)
    \t* And our image has height:width of 23:21 for 483 'pixels' in each
    \t  contrived layer of the image.

    As it is hopefully being made clear now, you should see that because each
    block has 32 threads on its x-axis, and we have only 23 elements on the corresponding
    axis in the image, only 1 of the 23 blocks the device created along the grid's x-axis
    will be used. Do note that the overhead of creating those unused blocks is a gross waste
    of GPU processor time and could potentially reduce the available resources to the block
    that does get used."""))

输出：

block_count=(23, 21); threads_per_block=32; image.shape=(23, 21, 3, 3)
thread id within block; x
[[ 0  0  0 ...  0  0  0]
 [ 1  1  1 ...  1  1  1]
 [ 2  2  2 ...  2  2  2]
 ...
 [20 20 20 ... 20 20 20]
 [21 21 21 ... 21 21 21]
 [22 22 22 ... 22 22 22]]

thread id within block; y
-- NOTE 1 regarding all zeros: see comment at the end of printout
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

sum of x,y thread id within block
[[ 0  0  0 ...  0  0  0]
 [ 1  1  1 ...  1  1  1]
 [ 2  2  2 ...  2  2  2]
 ...
 [20 20 20 ... 20 20 20]
 [21 21 21 ... 21 21 21]
 [22 22 22 ... 22 22 22]]

block id within grid; x
-- NOTE 2 also regarding all zeros: see second comment at the eod of printout
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

block id within grid; y
[[ 0  1  2 ... 18 19 20]
 [ 0  1  2 ... 18 19 20]
 [ 0  1  2 ... 18 19 20]
 ...
 [ 0  1  2 ... 18 19 20]
 [ 0  1  2 ... 18 19 20]
 [ 0  1  2 ... 18 19 20]]

sum of x,y block id within grid
[[ 0  1  2 ... 18 19 20]
 [ 0  1  2 ... 18 19 20]
 [ 0  1  2 ... 18 19 20]
 ...
 [ 0  1  2 ... 18 19 20]
 [ 0  1  2 ... 18 19 20]
 [ 0  1  2 ... 18 19 20]]

thread unique global x id within full grid; x
[[ 0  0  0 ...  0  0  0]
 [ 1  1  1 ...  1  1  1]
 [ 2  2  2 ...  2  2  2]
 ...
 [20 20 20 ... 20 20 20]
 [21 21 21 ... 21 21 21]
 [22 22 22 ... 22 22 22]]

thread unique global y id within full grid; y
[[ 0  1  2 ... 18 19 20]
 [ 0  1  2 ... 18 19 20]
 [ 0  1  2 ... 18 19 20]
 ...
 [ 0  1  2 ... 18 19 20]
 [ 0  1  2 ... 18 19 20]
 [ 0  1  2 ... 18 19 20]]

sum of thread's unique global x,y ids
[[ 0  1  2 ... 18 19 20]
 [ 1  2  3 ... 19 20 21]
 [ 2  3  4 ... 20 21 22]
 ...
 [20 21 22 ... 38 39 40]
 [21 22 23 ... 39 40 41]
 [22 23 24 ... 40 41 42]]
End of 32 threads_per_block output------------------------------------


block_count=(23, 21); threads_per_block=16; image.shape=(23, 21, 3, 3)
thread id within block; x
[[0 0 0 ... 0 0 0]
 [1 1 1 ... 1 1 1]
 [2 2 2 ... 2 2 2]
 ...
 [4 4 4 ... 4 4 4]
 [5 5 5 ... 5 5 5]
 [6 6 6 ... 6 6 6]]

thread id within block; y 
-- again, see NOTE 1
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

sum of x,y thread id within block
[[0 0 0 ... 0 0 0]
 [1 1 1 ... 1 1 1]
 [2 2 2 ... 2 2 2]
 ...
 [4 4 4 ... 4 4 4]
 [5 5 5 ... 5 5 5]
 [6 6 6 ... 6 6 6]]

block id within grid; x 
-- notice that unlike when we had 32 thread_per_block, not all 0
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [1 1 1 ... 1 1 1]
 [1 1 1 ... 1 1 1]
 [1 1 1 ... 1 1 1]]

block id within grid; y
[[ 0  1  2 ... 18 19 20]
 [ 0  1  2 ... 18 19 20]
 [ 0  1  2 ... 18 19 20]
 ...
 [ 0  1  2 ... 18 19 20]
 [ 0  1  2 ... 18 19 20]
 [ 0  1  2 ... 18 19 20]]

sum of x,y block id within grid
[[ 0  1  2 ... 18 19 20]
 [ 0  1  2 ... 18 19 20]
 [ 0  1  2 ... 18 19 20]
 ...
 [ 1  2  3 ... 19 20 21]
 [ 1  2  3 ... 19 20 21]
 [ 1  2  3 ... 19 20 21]]

thread unique global x id within full grid; x
[[ 0  0  0 ...  0  0  0]
 [ 1  1  1 ...  1  1  1]
 [ 2  2  2 ...  2  2  2]
 ...
 [20 20 20 ... 20 20 20]
 [21 21 21 ... 21 21 21]
 [22 22 22 ... 22 22 22]]

thread unique global y id within full grid; y
[[ 0  1  2 ... 18 19 20]
 [ 0  1  2 ... 18 19 20]
 [ 0  1  2 ... 18 19 20]
 ...
 [ 0  1  2 ... 18 19 20]
 [ 0  1  2 ... 18 19 20]
 [ 0  1  2 ... 18 19 20]]

sum of thread's unique global x,y ids
[[ 0  1  2 ... 18 19 20]
 [ 1  2  3 ... 19 20 21]
 [ 2  3  4 ... 20 21 22]
 ...
 [20 21 22 ... 38 39 40]
 [21 22 23 ... 39 40 41]
 [22 23 24 ... 40 41 42]]

NOTE 1:
The thread IDs recorded for 'thread id within block; y'
are all zero for both versions of `threads_per_block` because we never
specify the number of threads per block that should be created for
the 'y' axis.

So, the compiler defaults to creating only a single thread along those
undefined axis of each block. For that reason, we see that the only
threadID.y value stored is 0 for all i,j elements of the array.

NOTE 2:
**Note 2 mostly pertains to the case where threads_per_block == 32 
  is greater than the number of elements in the corresponding axis of the image**
The block.x IDs recorded for 'block id within grid; x' are all zero for
the `32` version of `threads_per_block` the relative difference in size between
the specified number of threads per block and the number of elements in the
image along the corresponding axis.

The size of a block, in any axis, is determined by the specified number
of threads for that axis. In this example script, we define threads_per_block
to have an explicit 32 threads in the x axis, leaving the compiler to give an
implicit 1 for both the y and z axis. We then tell the compiler to create 23 blocks
in the x-axis, and 21 blocks in the y; resulting in:
    * A kernel where the device creates a grid of blocks, 23:21:1 for 483 blocks
        * (x:y:z -> 23:21:1)
    * Where each block has 32 threads
        * (x:y:z -> 32:1:1)
    * And our image has height:width of 23:21 for 483 'pixels' in each
      contrived layer of the image.

As it is hopefully being made clear now, you should see that because each
block has 32 threads on its x-axis, and we have only 23 elements on the corresponding
axis in the image, only 1 of the 23 blocks the device created along the grid's x-axis
will be used. Do note that the overhead of creating those unused blocks is a gross waste
of GPU processor time and could potentially reduce the available resources to the block
that does get used.

关于numba中CUDA线程的一个简单问题

A simple question on CUDA threads in numba

python

jit

cuda

numba

示例脚本：

输出：