dask 在运行 map_partitions 之前如何知道变量状态？

Question

在下面的 dask 代码中，我在执行两个 map_partitions 之前将 x 设置为 1 和 2。结果好像还不错，就是没看懂。

如果 dask 只在找到 compute() 时才等待运行这两个 map_partitions，而此时它发现 compute() x 是 2，那该多好啊知道第一个 map_partitions?

中的 x = 1

pdf = pd.DataFrame({
    'id': [1, 1, 1, 2, 2, 3, 4, 1, 2, 2, 1],
    'balance': [150, 140, 130, 280, 260, 220, 230, 330, 420, 120, 210]
})    
ddf = dd.from_pandas(pdf, npartitions=2) 
    
def func(df, a):
    return a

x = 1
ddf['should_be_1'] = ddf.map_partitions(func, x,  meta='int')

x = 2
ddf['should_be_2'] = ddf.map_partitions(func, x,  meta='int')

ddf.compute()

    id  balance should_be_1 should_be_2
0   1   150     1           2
1   1   140     1           2
2   1   130     1           2
3   2   280     1           2
4   2   260     1           2
5   3   220     1           2
6   4   230     1           2
7   1   330     1           2
8   2   420     1           2
9   2   120     1           2
10  1   210     1           2

Answer 1

计算被延迟，但是 dask 会跟踪传递给延迟函数的参数值。稍后更改参数值不会更改传递给较早延迟计算的值：

from dask import delayed

@delayed
def f(x):
    return x

x = 1
a = f(x)

x = 2
b = f(x)

print(dict(a.dask))
# {'some_hash': (<function f at 0x7fab1b72c550>, 1)}

print(dict(b.dask))
# {'some_hash': (<function f at 0x7fab1b72c550>, 2)}

dask 在运行 map_partitions 之前如何知道变量状态？

How does dask know variable states before it runs map_partitions?

python

dask

dask-distributed