Dask where returns NaN on valid array

Question

我正在尝试使用 dask 加速我的 numpy 代码。以下是我的 numpy 代码

的一部分

arr_1 = np.load('<arr1_path>.npy')
arr_2 = np.load('<arr2_path>.npy')
arr_3 = np.load('<arr3_path>.npy')

arr_1 = np.concatenate((arr_1, arr_2[:,:,np.newaxis]),axis = 2)
arr_1_half = totaldata.shape[0]//2
arr_4 = arr_3[:half]
[r,c] = np.where(arr_4 == True)
[rn,cn] = np.where(arr_4 == False)

print(len(r))

这会打印有效结果并且工作正常。但是，下面dask相当于

arr_1 = da.from_zarr('<arr1_path>.zarr')
arr_2 = da.from_zarr('<arr2_path>.zarr')
arr_3 = da.from_zarr('<arr3_path>.zarr')

arr_1 = da.concatenate((arr_1, arr_2[:,:,np.newaxis]),axis = 2)
arr_1_half = totaldata.shape[0]//2
arr_4 = arr_3[:half]
[r,c] = da.where(arr_4 == True)
[rn,cn] = da.where(arr_4 == False)

print(len(r)) # <----- Error: float' object cannot be interpreted as an integer

结果 r 为

dask.array<getitem, shape=(nan,), dtype=int64, chunksize=(nan,), chunktype=numpy.ndarray>

因此出现上述错误。由于 dask 数组是延迟计算的，我是否必须在某处显式调用 compute() 或类似的？还是我缺少一些基本的东西？任何帮助将不胜感激。

Answer 1

您用 da.where 构造的数组有 unknown chunk sizes, which can happen whenever the size of an array depends on lazy computations that haven’t yet been performed. Unknown values within shape or chunks are designated using np.nan rather than an integer, which is why you see the ValueError (this error message was improved in the last few months). The solution is to use compute_chunk_sizes:

import dask.array as da
x = da.from_array(np.random.randn(100), chunks=20)
y = x[x > 0]
# len(y) # ValueError: Cannot call len() on object with unknown chunk size.
y.compute_chunk_sizes() # modifies y in-place
len(y)

Dask where returns NaN on valid array

Dask where returns NaN on valid array

numpy

dask

numpy-ndarray

zarr