为什么在幻数 16777216 之后计算唯一工作 ID 错误?
Why calculating of unique work id wrong after magic number 16777216?
我不明白我哪里出错了。
目标 - 用唯一的工作 ID 填充数组。
我找到了神奇的数字 - q = 16777216 = 1024 * 1024 * 16.
如果数组长度小于 q,一切都很好:check_array function returns True with params:
arr_size = 1024 * 1024 * 16
grid_size = (1024, 1024, 16)
block_size = (32, 8, 4)
check_array(...) -> True
但是...
arr_size = 1024 * 1024 * 20
grid_size = (1024, 1024, 20)
block_size = (32, 8, 4)
check_array(...) -> False
我做了切片arr[16777216:1677726]
理论上结果必须是
arr[16777216:1677726] = [16777216, 16777217, 16777218, 16777219, 16777220, 16777221, 16777222, 16777223, 16777224, 16777225]
但实际上
arr[16777216:1677726] = [16777216, **16777216**, 16777218, **16777220**, 16777220, **16777220**, 16777222, **16777224**, 16777224, **16777224**]
请帮帮我。
GPU 卡是 Nvidia GeForce GTX 1050 Ti
打开 CL 内核:
int get_general_block_id(int3 block_ids)
{
int3 grid_size = {get_num_groups(0), get_num_groups(1), get_num_groups(2)};
return block_ids.s0 + block_ids.s1 * grid_size.s0 + block_ids.s2 * grid_size.s0 * grid_size.s1;
}
int get_local_thread_id(int3 thread_ids)
{
int3 block_size = {get_local_size(0), get_local_size(1), get_local_size(2)};
return thread_ids.s0 + thread_ids.s1 * block_size.s0 + thread_ids.s2 * block_size.s0 * block_size.s1;
}
int get_global_thread_id(int gen_block_id, int gen_thread_id)
{
int3 block_size = {get_local_size(0), get_local_size(1), get_local_size(2)};
return gen_block_id * block_size.s0 * block_size.s1 * block_size.s2 + gen_thread_id;
}
kernel void index_function(global float *array, int length)
{
int3 b_ids = {get_group_id(0), get_group_id(1), get_group_id(2)};
int gb_id = get_general_block_id(b_ids);
int3 t_ids = {get_local_id(0), get_local_id(1), get_local_id(2)};
int lt_id = get_local_thread_id(t_ids);
int global_tid = get_global_thread_id(gb_id, lt_id);
if (global_tid >= length)
{
return;
}
array[global_tid] = global_tid;
}
Python代码:
import numpy as np
import pyopencl as cl
def load_core(filename='core.c') -> str:
core_text = ''
with open(filename, 'r') as f:
for line in f:
core_text += line
return core_text
def create_context_and_queue():
# Run only one device!
platforms = cl.get_platforms()
gpu_dev = []
for platform_item in platforms:
devs = platform_item.get_devices(device_type=cl.device_type.GPU)
gpu_dev += devs
context = cl.Context(devices=[gpu_dev[0]])
queue = cl.CommandQueue(context)
return context, queue
def compile_function(context, filename='core.c'):
core = load_core(filename)
module = cl.Program(context, core).build()
return module.index_function
def check_array(arr: np.ndarray):
for i, item in enumerate(arr):
delta = abs(i - item)
if delta > 1e-5:
return False
else:
return True
arr_size = 1000 * 1000 * 1000
grid_size = (1000, 1000, 60)
block_size = (10, 10, 10)
if __name__ == '__main__':
context, queue = create_context_and_queue()
cl_function = compile_function(context)
arr = np.zeros(shape=arr_size, dtype=np.float32)
arr.fill(-1)
mf = cl.mem_flags
a_g = cl.Buffer(context, mf.READ_WRITE | mf.COPY_HOST_PTR, hostbuf=arr)
cl_function(queue, grid_size, block_size, a_g, np.uint32(arr_size))
result_arr = cl.Buffer(context, mf.WRITE_ONLY, arr.nbytes)
cl.enqueue_copy(queue, arr, a_g)
print(check_array(arr))
更新
我在内核函数中添加了代码部分index_function:
if ((global_tid >= 16777210) && (global_tid <= 16777245))
{
printf("block_id=%i thread_id=%i gid=%i val=%f\n", gb_id, lt_id, global_tid, array[global_tid]);
}
结果:
block_id=16383 thread_id=1018 gid=16777210 val=16777210.000000
block_id=16383 thread_id=1019 gid=16777211 val=16777211.000000
block_id=16383 thread_id=1020 gid=16777212 val=16777212.000000
block_id=16383 thread_id=1021 gid=16777213 val=16777213.000000
block_id=16383 thread_id=1022 gid=16777214 val=16777214.000000
block_id=16383 thread_id=1023 gid=16777215 val=16777215.000000
block_id=16384 thread_id=0 gid=16777216 val=16777216.000000
block_id=16384 thread_id=1 gid=16777217 val=16777216.000000
block_id=16384 thread_id=2 gid=16777218 val=16777218.000000
block_id=16384 thread_id=3 gid=16777219 val=16777220.000000
block_id=16384 thread_id=4 gid=16777220 val=16777220.000000
block_id=16384 thread_id=5 gid=16777221 val=16777220.000000
block_id=16384 thread_id=6 gid=16777222 val=16777222.000000
block_id=16384 thread_id=7 gid=16777223 val=16777224.000000
block_id=16384 thread_id=8 gid=16777224 val=16777224.000000
block_id=16384 thread_id=9 gid=16777225 val=16777224.000000
block_id=16384 thread_id=10 gid=16777226 val=16777226.000000
block_id=16384 thread_id=11 gid=16777227 val=16777228.000000
block_id=16384 thread_id=12 gid=16777228 val=16777228.000000
block_id=16384 thread_id=13 gid=16777229 val=16777228.000000
block_id=16384 thread_id=14 gid=16777230 val=16777230.000000
block_id=16384 thread_id=15 gid=16777231 val=16777232.000000
block_id=16384 thread_id=16 gid=16777232 val=16777232.000000
block_id=16384 thread_id=17 gid=16777233 val=16777232.000000
block_id=16384 thread_id=18 gid=16777234 val=16777234.000000
block_id=16384 thread_id=19 gid=16777235 val=16777236.000000
block_id=16384 thread_id=20 gid=16777236 val=16777236.000000
block_id=16384 thread_id=21 gid=16777237 val=16777236.000000
block_id=16384 thread_id=22 gid=16777238 val=16777238.000000
block_id=16384 thread_id=23 gid=16777239 val=16777240.000000
block_id=16384 thread_id=24 gid=16777240 val=16777240.000000
block_id=16384 thread_id=25 gid=16777241 val=16777240.000000
block_id=16384 thread_id=26 gid=16777242 val=16777242.000000
block_id=16384 thread_id=27 gid=16777243 val=16777244.000000
block_id=16384 thread_id=28 gid=16777244 val=16777244.000000
block_id=16384 thread_id=29 gid=16777245 val=16777244.000000
为什么数组元素不等于赋值?
16777216 + 1 = 2^24 +1 = 16777217 是不能精确表示为 IEEE 754 浮点值的最小整数。有关详细信息,请参阅此问题和答案:Why does a float variable stop incrementing at 16777216 in C#?
将您的代码更改为仅使用整数,它将起作用。
我不明白我哪里出错了。 目标 - 用唯一的工作 ID 填充数组。
我找到了神奇的数字 - q = 16777216 = 1024 * 1024 * 16.
如果数组长度小于 q,一切都很好:check_array function returns True with params:
arr_size = 1024 * 1024 * 16
grid_size = (1024, 1024, 16)
block_size = (32, 8, 4)
check_array(...) -> True
但是...
arr_size = 1024 * 1024 * 20
grid_size = (1024, 1024, 20)
block_size = (32, 8, 4)
check_array(...) -> False
我做了切片arr[16777216:1677726]
理论上结果必须是
arr[16777216:1677726] = [16777216, 16777217, 16777218, 16777219, 16777220, 16777221, 16777222, 16777223, 16777224, 16777225]
但实际上
arr[16777216:1677726] = [16777216, **16777216**, 16777218, **16777220**, 16777220, **16777220**, 16777222, **16777224**, 16777224, **16777224**]
请帮帮我。
GPU 卡是 Nvidia GeForce GTX 1050 Ti
打开 CL 内核:
int get_general_block_id(int3 block_ids)
{
int3 grid_size = {get_num_groups(0), get_num_groups(1), get_num_groups(2)};
return block_ids.s0 + block_ids.s1 * grid_size.s0 + block_ids.s2 * grid_size.s0 * grid_size.s1;
}
int get_local_thread_id(int3 thread_ids)
{
int3 block_size = {get_local_size(0), get_local_size(1), get_local_size(2)};
return thread_ids.s0 + thread_ids.s1 * block_size.s0 + thread_ids.s2 * block_size.s0 * block_size.s1;
}
int get_global_thread_id(int gen_block_id, int gen_thread_id)
{
int3 block_size = {get_local_size(0), get_local_size(1), get_local_size(2)};
return gen_block_id * block_size.s0 * block_size.s1 * block_size.s2 + gen_thread_id;
}
kernel void index_function(global float *array, int length)
{
int3 b_ids = {get_group_id(0), get_group_id(1), get_group_id(2)};
int gb_id = get_general_block_id(b_ids);
int3 t_ids = {get_local_id(0), get_local_id(1), get_local_id(2)};
int lt_id = get_local_thread_id(t_ids);
int global_tid = get_global_thread_id(gb_id, lt_id);
if (global_tid >= length)
{
return;
}
array[global_tid] = global_tid;
}
Python代码:
import numpy as np
import pyopencl as cl
def load_core(filename='core.c') -> str:
core_text = ''
with open(filename, 'r') as f:
for line in f:
core_text += line
return core_text
def create_context_and_queue():
# Run only one device!
platforms = cl.get_platforms()
gpu_dev = []
for platform_item in platforms:
devs = platform_item.get_devices(device_type=cl.device_type.GPU)
gpu_dev += devs
context = cl.Context(devices=[gpu_dev[0]])
queue = cl.CommandQueue(context)
return context, queue
def compile_function(context, filename='core.c'):
core = load_core(filename)
module = cl.Program(context, core).build()
return module.index_function
def check_array(arr: np.ndarray):
for i, item in enumerate(arr):
delta = abs(i - item)
if delta > 1e-5:
return False
else:
return True
arr_size = 1000 * 1000 * 1000
grid_size = (1000, 1000, 60)
block_size = (10, 10, 10)
if __name__ == '__main__':
context, queue = create_context_and_queue()
cl_function = compile_function(context)
arr = np.zeros(shape=arr_size, dtype=np.float32)
arr.fill(-1)
mf = cl.mem_flags
a_g = cl.Buffer(context, mf.READ_WRITE | mf.COPY_HOST_PTR, hostbuf=arr)
cl_function(queue, grid_size, block_size, a_g, np.uint32(arr_size))
result_arr = cl.Buffer(context, mf.WRITE_ONLY, arr.nbytes)
cl.enqueue_copy(queue, arr, a_g)
print(check_array(arr))
更新
我在内核函数中添加了代码部分index_function:
if ((global_tid >= 16777210) && (global_tid <= 16777245))
{
printf("block_id=%i thread_id=%i gid=%i val=%f\n", gb_id, lt_id, global_tid, array[global_tid]);
}
结果:
block_id=16383 thread_id=1018 gid=16777210 val=16777210.000000
block_id=16383 thread_id=1019 gid=16777211 val=16777211.000000
block_id=16383 thread_id=1020 gid=16777212 val=16777212.000000
block_id=16383 thread_id=1021 gid=16777213 val=16777213.000000
block_id=16383 thread_id=1022 gid=16777214 val=16777214.000000
block_id=16383 thread_id=1023 gid=16777215 val=16777215.000000
block_id=16384 thread_id=0 gid=16777216 val=16777216.000000
block_id=16384 thread_id=1 gid=16777217 val=16777216.000000
block_id=16384 thread_id=2 gid=16777218 val=16777218.000000
block_id=16384 thread_id=3 gid=16777219 val=16777220.000000
block_id=16384 thread_id=4 gid=16777220 val=16777220.000000
block_id=16384 thread_id=5 gid=16777221 val=16777220.000000
block_id=16384 thread_id=6 gid=16777222 val=16777222.000000
block_id=16384 thread_id=7 gid=16777223 val=16777224.000000
block_id=16384 thread_id=8 gid=16777224 val=16777224.000000
block_id=16384 thread_id=9 gid=16777225 val=16777224.000000
block_id=16384 thread_id=10 gid=16777226 val=16777226.000000
block_id=16384 thread_id=11 gid=16777227 val=16777228.000000
block_id=16384 thread_id=12 gid=16777228 val=16777228.000000
block_id=16384 thread_id=13 gid=16777229 val=16777228.000000
block_id=16384 thread_id=14 gid=16777230 val=16777230.000000
block_id=16384 thread_id=15 gid=16777231 val=16777232.000000
block_id=16384 thread_id=16 gid=16777232 val=16777232.000000
block_id=16384 thread_id=17 gid=16777233 val=16777232.000000
block_id=16384 thread_id=18 gid=16777234 val=16777234.000000
block_id=16384 thread_id=19 gid=16777235 val=16777236.000000
block_id=16384 thread_id=20 gid=16777236 val=16777236.000000
block_id=16384 thread_id=21 gid=16777237 val=16777236.000000
block_id=16384 thread_id=22 gid=16777238 val=16777238.000000
block_id=16384 thread_id=23 gid=16777239 val=16777240.000000
block_id=16384 thread_id=24 gid=16777240 val=16777240.000000
block_id=16384 thread_id=25 gid=16777241 val=16777240.000000
block_id=16384 thread_id=26 gid=16777242 val=16777242.000000
block_id=16384 thread_id=27 gid=16777243 val=16777244.000000
block_id=16384 thread_id=28 gid=16777244 val=16777244.000000
block_id=16384 thread_id=29 gid=16777245 val=16777244.000000
为什么数组元素不等于赋值?
16777216 + 1 = 2^24 +1 = 16777217 是不能精确表示为 IEEE 754 浮点值的最小整数。有关详细信息,请参阅此问题和答案:Why does a float variable stop incrementing at 16777216 in C#?
将您的代码更改为仅使用整数,它将起作用。