Dask where returns NaN on valid array
Dask where returns NaN on valid array
我正在尝试使用 dask
加速我的 numpy
代码。以下是我的 numpy
代码
的一部分
arr_1 = np.load('<arr1_path>.npy')
arr_2 = np.load('<arr2_path>.npy')
arr_3 = np.load('<arr3_path>.npy')
arr_1 = np.concatenate((arr_1, arr_2[:,:,np.newaxis]),axis = 2)
arr_1_half = totaldata.shape[0]//2
arr_4 = arr_3[:half]
[r,c] = np.where(arr_4 == True)
[rn,cn] = np.where(arr_4 == False)
print(len(r))
这会打印有效结果并且工作正常。但是,下面dask
相当于
arr_1 = da.from_zarr('<arr1_path>.zarr')
arr_2 = da.from_zarr('<arr2_path>.zarr')
arr_3 = da.from_zarr('<arr3_path>.zarr')
arr_1 = da.concatenate((arr_1, arr_2[:,:,np.newaxis]),axis = 2)
arr_1_half = totaldata.shape[0]//2
arr_4 = arr_3[:half]
[r,c] = da.where(arr_4 == True)
[rn,cn] = da.where(arr_4 == False)
print(len(r)) # <----- Error: float' object cannot be interpreted as an integer
结果 r
为
dask.array<getitem, shape=(nan,), dtype=int64, chunksize=(nan,), chunktype=numpy.ndarray>
因此出现上述错误。由于 dask
数组是延迟计算的,我是否必须在某处显式调用 compute()
或类似的?还是我缺少一些基本的东西?任何帮助将不胜感激。
您用 da.where
构造的数组有 unknown chunk sizes, which can happen whenever the size of an array depends on lazy computations that haven’t yet been performed. Unknown values within shape or chunks are designated using np.nan rather than an integer, which is why you see the ValueError
(this error message was improved in the last few months). The solution is to use compute_chunk_sizes
:
import dask.array as da
x = da.from_array(np.random.randn(100), chunks=20)
y = x[x > 0]
# len(y) # ValueError: Cannot call len() on object with unknown chunk size.
y.compute_chunk_sizes() # modifies y in-place
len(y)
我正在尝试使用 dask
加速我的 numpy
代码。以下是我的 numpy
代码
arr_1 = np.load('<arr1_path>.npy')
arr_2 = np.load('<arr2_path>.npy')
arr_3 = np.load('<arr3_path>.npy')
arr_1 = np.concatenate((arr_1, arr_2[:,:,np.newaxis]),axis = 2)
arr_1_half = totaldata.shape[0]//2
arr_4 = arr_3[:half]
[r,c] = np.where(arr_4 == True)
[rn,cn] = np.where(arr_4 == False)
print(len(r))
这会打印有效结果并且工作正常。但是,下面dask
相当于
arr_1 = da.from_zarr('<arr1_path>.zarr')
arr_2 = da.from_zarr('<arr2_path>.zarr')
arr_3 = da.from_zarr('<arr3_path>.zarr')
arr_1 = da.concatenate((arr_1, arr_2[:,:,np.newaxis]),axis = 2)
arr_1_half = totaldata.shape[0]//2
arr_4 = arr_3[:half]
[r,c] = da.where(arr_4 == True)
[rn,cn] = da.where(arr_4 == False)
print(len(r)) # <----- Error: float' object cannot be interpreted as an integer
结果 r
为
dask.array<getitem, shape=(nan,), dtype=int64, chunksize=(nan,), chunktype=numpy.ndarray>
因此出现上述错误。由于 dask
数组是延迟计算的,我是否必须在某处显式调用 compute()
或类似的?还是我缺少一些基本的东西?任何帮助将不胜感激。
您用 da.where
构造的数组有 unknown chunk sizes, which can happen whenever the size of an array depends on lazy computations that haven’t yet been performed. Unknown values within shape or chunks are designated using np.nan rather than an integer, which is why you see the ValueError
(this error message was improved in the last few months). The solution is to use compute_chunk_sizes
:
import dask.array as da
x = da.from_array(np.random.randn(100), chunks=20)
y = x[x > 0]
# len(y) # ValueError: Cannot call len() on object with unknown chunk size.
y.compute_chunk_sizes() # modifies y in-place
len(y)