遍历 dask 数组块
Iterating through dask array chunks
我正在尝试手动遍历 dask 数组的块,一个接一个,并应用我的计算。我知道 dask 的一个好处是它可以为我进行迭代,但是我的计算失败了(由于我认为与 dask 无关的原因)并且我想手动迭代以进行调试。我该怎么做?
我在想象这样的事情:
import dask.array as da
data = da.random.randint(0, 30, size=(1_000, 100, 100), chunks=(-1, 10, 10))
for chunk in data.iterchunks():
# chunk would contain some information about which chunk I have access to,
# and I could somehow get the data contained in that chunk
chunk_data = get_chunk(chunk)
my_function(chunk_data)
我返回的 chunk
有一些关于我所在的块的信息,并且还会获取该块的数据。
尝试使用 data.chunks
而不是 data.iterchunks()
。
使用 arr.blocks
属性 访问每个块中的数据。 BlockView 对象有一个 array-like 接口,但是访问 BlockView 数组中的元素 returns 原始数组中的选定块:
In [11]: data
Out[11]: dask.array<randint, shape=(1000, 100, 100), dtype=int64, chunksize=(1000, 10, 10), chunktype=numpy.ndarray>
In [12]: data.blocks
Out[12]: <dask.array.core.BlockView at 0x1730b2da0>
In [13]: data.blocks.shape
Out[13]: (1, 10, 10)
In [14]: data.blocks[0, 0, 0]
Out[14]: dask.array<blocks, shape=(1000, 10, 10), dtype=int64, chunksize=(1000, 10, 10), chunktype=numpy.ndarray>
In [15]: data.blocks[0, 0, 0].compute()
Out[15]:
array([[[14, 5, 24, ..., 25, 20, 6],
[17, 12, 2, ..., 27, 13, 18],
[13, 25, 2, ..., 7, 5, 22],
...,
[12, 22, 26, ..., 15, 4, 11],
[ 0, 26, 28, ..., 22, 14, 4],
[ 9, 21, 14, ..., 15, 18, 21]],
...,
[[ 3, 2, 20, ..., 27, 0, 12],
[21, 17, 7, ..., 23, 3, 23],
[24, 13, 0, ..., 26, 1, 0],
...,
[ 5, 25, 6, ..., 22, 6, 16],
[16, 25, 21, ..., 22, 14, 15],
[ 8, 20, 17, ..., 29, 13, 1]]])
因此,在您的情况下,您可以使用以下语句遍历所有块:
In [34]: for inds in itertools.product(*map(range, data.blocks.shape)):
...: chunk = data.blocks[inds]
...: my_function(chunk)
这会很慢,但我认为你在找什么。
您可以使用 da.map_blocks
并避免 for
循环:
import dask.array as da
data = da.random.randint(0, 30, size=(1_000, 100, 100), chunks=(-1, 10, 10))
mapped_data = da.map_blocks(my_function, data)
# This is equivalent
mapped_data = data.map_blocks(my_function)
我正在尝试手动遍历 dask 数组的块,一个接一个,并应用我的计算。我知道 dask 的一个好处是它可以为我进行迭代,但是我的计算失败了(由于我认为与 dask 无关的原因)并且我想手动迭代以进行调试。我该怎么做?
我在想象这样的事情:
import dask.array as da
data = da.random.randint(0, 30, size=(1_000, 100, 100), chunks=(-1, 10, 10))
for chunk in data.iterchunks():
# chunk would contain some information about which chunk I have access to,
# and I could somehow get the data contained in that chunk
chunk_data = get_chunk(chunk)
my_function(chunk_data)
我返回的 chunk
有一些关于我所在的块的信息,并且还会获取该块的数据。
尝试使用 data.chunks
而不是 data.iterchunks()
。
使用 arr.blocks
属性 访问每个块中的数据。 BlockView 对象有一个 array-like 接口,但是访问 BlockView 数组中的元素 returns 原始数组中的选定块:
In [11]: data
Out[11]: dask.array<randint, shape=(1000, 100, 100), dtype=int64, chunksize=(1000, 10, 10), chunktype=numpy.ndarray>
In [12]: data.blocks
Out[12]: <dask.array.core.BlockView at 0x1730b2da0>
In [13]: data.blocks.shape
Out[13]: (1, 10, 10)
In [14]: data.blocks[0, 0, 0]
Out[14]: dask.array<blocks, shape=(1000, 10, 10), dtype=int64, chunksize=(1000, 10, 10), chunktype=numpy.ndarray>
In [15]: data.blocks[0, 0, 0].compute()
Out[15]:
array([[[14, 5, 24, ..., 25, 20, 6],
[17, 12, 2, ..., 27, 13, 18],
[13, 25, 2, ..., 7, 5, 22],
...,
[12, 22, 26, ..., 15, 4, 11],
[ 0, 26, 28, ..., 22, 14, 4],
[ 9, 21, 14, ..., 15, 18, 21]],
...,
[[ 3, 2, 20, ..., 27, 0, 12],
[21, 17, 7, ..., 23, 3, 23],
[24, 13, 0, ..., 26, 1, 0],
...,
[ 5, 25, 6, ..., 22, 6, 16],
[16, 25, 21, ..., 22, 14, 15],
[ 8, 20, 17, ..., 29, 13, 1]]])
因此,在您的情况下,您可以使用以下语句遍历所有块:
In [34]: for inds in itertools.product(*map(range, data.blocks.shape)):
...: chunk = data.blocks[inds]
...: my_function(chunk)
这会很慢,但我认为你在找什么。
您可以使用 da.map_blocks
并避免 for
循环:
import dask.array as da
data = da.random.randint(0, 30, size=(1_000, 100, 100), chunks=(-1, 10, 10))
mapped_data = da.map_blocks(my_function, data)
# This is equivalent
mapped_data = data.map_blocks(my_function)