遍历 dask 数组块

Iterating through dask array chunks

我正在尝试手动遍历 dask 数组的块,一个接一个,并应用我的计算。我知道 dask 的一个好处是它可以为我进行迭代,但是我的计算失败了(由于我认为与 dask 无关的原因)并且我想手动迭代以进行调试。我该怎么做?

我在想象这样的事情:

import dask.array as da
data = da.random.randint(0, 30, size=(1_000, 100, 100), chunks=(-1, 10, 10))

for chunk in data.iterchunks():
    # chunk would contain some information about which chunk I have access to, 
    # and I could somehow get the data contained in that chunk
    chunk_data = get_chunk(chunk)
    my_function(chunk_data)

我返回的 chunk 有一些关于我所在的块的信息,并且还会获取该块的数据。

尝试使用 data.chunks 而不是 data.iterchunks()

使用 arr.blocks 属性 访问每个块中的数据。 BlockView 对象有一个 array-like 接口,但是访问 BlockView 数组中的元素 returns 原始数组中的选定块:

In [11]: data
Out[11]: dask.array<randint, shape=(1000, 100, 100), dtype=int64, chunksize=(1000, 10, 10), chunktype=numpy.ndarray>

In [12]: data.blocks
Out[12]: <dask.array.core.BlockView at 0x1730b2da0>

In [13]: data.blocks.shape
Out[13]: (1, 10, 10)

In [14]: data.blocks[0, 0, 0]
Out[14]: dask.array<blocks, shape=(1000, 10, 10), dtype=int64, chunksize=(1000, 10, 10), chunktype=numpy.ndarray>

In [15]: data.blocks[0, 0, 0].compute()
Out[15]:
array([[[14,  5, 24, ..., 25, 20,  6],
        [17, 12,  2, ..., 27, 13, 18],
        [13, 25,  2, ...,  7,  5, 22],
        ...,
        [12, 22, 26, ..., 15,  4, 11],
        [ 0, 26, 28, ..., 22, 14,  4],
        [ 9, 21, 14, ..., 15, 18, 21]],

       ...,

       [[ 3,  2, 20, ..., 27,  0, 12],
        [21, 17,  7, ..., 23,  3, 23],
        [24, 13,  0, ..., 26,  1,  0],
        ...,
        [ 5, 25,  6, ..., 22,  6, 16],
        [16, 25, 21, ..., 22, 14, 15],
        [ 8, 20, 17, ..., 29, 13,  1]]])

因此,在您的情况下,您可以使用以下语句遍历所有块:

In [34]: for inds in itertools.product(*map(range, data.blocks.shape)):
    ...:     chunk = data.blocks[inds]
    ...:     my_function(chunk)

这会很慢,但我认为你在找什么。

您可以使用 da.map_blocks 并避免 for 循环:

import dask.array as da
data = da.random.randint(0, 30, size=(1_000, 100, 100), chunks=(-1, 10, 10))
mapped_data = da.map_blocks(my_function, data)
# This is equivalent
mapped_data = data.map_blocks(my_function)