为什么在幻数 16777216 之后计算唯一工作 ID 错误?

Why calculating of unique work id wrong after magic number 16777216?

我不明白我哪里出错了。 目标 - 用唯一的工作 ID 填充数组。

我找到了神奇的数字 - q = 16777216 = 1024 * 1024 * 16. 如果数组长度小于 q,一切都很好:check_array function returns True with params:

arr_size = 1024 * 1024 * 16
grid_size = (1024, 1024, 16)
block_size = (32, 8, 4)

check_array(...) -> True

但是...

arr_size = 1024 * 1024 * 20
grid_size = (1024, 1024, 20)
block_size = (32, 8, 4)

check_array(...) -> False

我做了切片arr[16777216:1677726]

理论上结果必须是 arr[16777216:1677726] = [16777216, 16777217, 16777218, 16777219, 16777220, 16777221, 16777222, 16777223, 16777224, 16777225]

但实际上 arr[16777216:1677726] = [16777216, **16777216**, 16777218, **16777220**, 16777220, **16777220**, 16777222, **16777224**, 16777224, **16777224**]

请帮帮我。

GPU 卡是 Nvidia GeForce GTX 1050 Ti

打开 CL 内核:

int get_general_block_id(int3 block_ids)
{
    int3 grid_size = {get_num_groups(0), get_num_groups(1), get_num_groups(2)};
    return block_ids.s0 + block_ids.s1 * grid_size.s0 + block_ids.s2 * grid_size.s0 * grid_size.s1;
}


int get_local_thread_id(int3 thread_ids)
{
    int3 block_size = {get_local_size(0), get_local_size(1), get_local_size(2)};
    return thread_ids.s0 + thread_ids.s1 * block_size.s0 + thread_ids.s2 * block_size.s0 * block_size.s1;
}


int get_global_thread_id(int gen_block_id, int gen_thread_id)
{
    int3 block_size = {get_local_size(0), get_local_size(1), get_local_size(2)};
    return gen_block_id * block_size.s0 * block_size.s1 * block_size.s2 + gen_thread_id; 
}


kernel void index_function(global float *array, int length)
{
    int3 b_ids = {get_group_id(0), get_group_id(1), get_group_id(2)};
    int gb_id = get_general_block_id(b_ids);
    
    int3 t_ids = {get_local_id(0), get_local_id(1), get_local_id(2)};
    int lt_id = get_local_thread_id(t_ids);
    
    int global_tid = get_global_thread_id(gb_id, lt_id);
    
    if (global_tid >= length)
    {
        return;
    }
    array[global_tid] = global_tid;
}

Python代码:

import numpy as np
import pyopencl as cl


def load_core(filename='core.c') -> str:
    core_text = ''
    with open(filename, 'r') as f:
        for line in f:
            core_text += line
    return core_text


def create_context_and_queue():
    # Run only one device!
    platforms = cl.get_platforms()
    gpu_dev = []
    for platform_item in platforms:
        devs = platform_item.get_devices(device_type=cl.device_type.GPU)
        gpu_dev += devs

    context = cl.Context(devices=[gpu_dev[0]])
    queue = cl.CommandQueue(context)
    return context, queue


def compile_function(context, filename='core.c'):
    core = load_core(filename)
    module = cl.Program(context, core).build()
    return module.index_function


def check_array(arr: np.ndarray):
    for i, item in enumerate(arr):
        delta = abs(i - item)
        if delta > 1e-5:
            return False
    else:
        return True


arr_size = 1000 * 1000 * 1000
grid_size = (1000, 1000, 60)
block_size = (10, 10, 10)
if __name__ == '__main__':
    context, queue = create_context_and_queue()
    cl_function = compile_function(context)

    arr = np.zeros(shape=arr_size, dtype=np.float32)
    arr.fill(-1)

    mf = cl.mem_flags
    a_g = cl.Buffer(context, mf.READ_WRITE | mf.COPY_HOST_PTR, hostbuf=arr)

    cl_function(queue, grid_size, block_size, a_g, np.uint32(arr_size))

    result_arr = cl.Buffer(context, mf.WRITE_ONLY, arr.nbytes)
    cl.enqueue_copy(queue, arr, a_g)

    print(check_array(arr))

更新

我在内核函数中添加了代码部分index_function:

if ((global_tid >= 16777210) && (global_tid <= 16777245))
    {
        printf("block_id=%i thread_id=%i gid=%i val=%f\n", gb_id, lt_id, global_tid, array[global_tid]);
    }

结果:

block_id=16383 thread_id=1018 gid=16777210 val=16777210.000000
block_id=16383 thread_id=1019 gid=16777211 val=16777211.000000
block_id=16383 thread_id=1020 gid=16777212 val=16777212.000000
block_id=16383 thread_id=1021 gid=16777213 val=16777213.000000
block_id=16383 thread_id=1022 gid=16777214 val=16777214.000000
block_id=16383 thread_id=1023 gid=16777215 val=16777215.000000
block_id=16384 thread_id=0 gid=16777216 val=16777216.000000
block_id=16384 thread_id=1 gid=16777217 val=16777216.000000
block_id=16384 thread_id=2 gid=16777218 val=16777218.000000
block_id=16384 thread_id=3 gid=16777219 val=16777220.000000
block_id=16384 thread_id=4 gid=16777220 val=16777220.000000
block_id=16384 thread_id=5 gid=16777221 val=16777220.000000
block_id=16384 thread_id=6 gid=16777222 val=16777222.000000
block_id=16384 thread_id=7 gid=16777223 val=16777224.000000
block_id=16384 thread_id=8 gid=16777224 val=16777224.000000
block_id=16384 thread_id=9 gid=16777225 val=16777224.000000
block_id=16384 thread_id=10 gid=16777226 val=16777226.000000
block_id=16384 thread_id=11 gid=16777227 val=16777228.000000
block_id=16384 thread_id=12 gid=16777228 val=16777228.000000
block_id=16384 thread_id=13 gid=16777229 val=16777228.000000
block_id=16384 thread_id=14 gid=16777230 val=16777230.000000
block_id=16384 thread_id=15 gid=16777231 val=16777232.000000
block_id=16384 thread_id=16 gid=16777232 val=16777232.000000
block_id=16384 thread_id=17 gid=16777233 val=16777232.000000
block_id=16384 thread_id=18 gid=16777234 val=16777234.000000
block_id=16384 thread_id=19 gid=16777235 val=16777236.000000
block_id=16384 thread_id=20 gid=16777236 val=16777236.000000
block_id=16384 thread_id=21 gid=16777237 val=16777236.000000
block_id=16384 thread_id=22 gid=16777238 val=16777238.000000
block_id=16384 thread_id=23 gid=16777239 val=16777240.000000
block_id=16384 thread_id=24 gid=16777240 val=16777240.000000
block_id=16384 thread_id=25 gid=16777241 val=16777240.000000
block_id=16384 thread_id=26 gid=16777242 val=16777242.000000
block_id=16384 thread_id=27 gid=16777243 val=16777244.000000
block_id=16384 thread_id=28 gid=16777244 val=16777244.000000
block_id=16384 thread_id=29 gid=16777245 val=16777244.000000

为什么数组元素不等于赋值?

16777216 + 1 = 2^24 +1 = 16777217 是不能精确表示为 IEEE 754 浮点值的最小整数。有关详细信息,请参阅此问题和答案:Why does a float variable stop incrementing at 16777216 in C#?

将您的代码更改为仅使用整数,它将起作用。