dask 在运行 map_partitions 之前如何知道变量状态?
How does dask know variable states before it runs map_partitions?
在下面的 dask 代码中,我在执行两个 map_partitions
之前将 x 设置为 1 和 2。结果好像还不错,就是没看懂。
如果 dask 只在找到 compute()
时才等待 运行 这两个 map_partitions
,而此时它发现 compute()
x 是 2,那该多好啊知道第一个 map_partitions
?
中的 x = 1
pdf = pd.DataFrame({
'id': [1, 1, 1, 2, 2, 3, 4, 1, 2, 2, 1],
'balance': [150, 140, 130, 280, 260, 220, 230, 330, 420, 120, 210]
})
ddf = dd.from_pandas(pdf, npartitions=2)
def func(df, a):
return a
x = 1
ddf['should_be_1'] = ddf.map_partitions(func, x, meta='int')
x = 2
ddf['should_be_2'] = ddf.map_partitions(func, x, meta='int')
ddf.compute()
id balance should_be_1 should_be_2
0 1 150 1 2
1 1 140 1 2
2 1 130 1 2
3 2 280 1 2
4 2 260 1 2
5 3 220 1 2
6 4 230 1 2
7 1 330 1 2
8 2 420 1 2
9 2 120 1 2
10 1 210 1 2
计算被延迟,但是 dask 会跟踪传递给延迟函数的参数值。稍后更改参数值不会更改传递给较早延迟计算的值:
from dask import delayed
@delayed
def f(x):
return x
x = 1
a = f(x)
x = 2
b = f(x)
print(dict(a.dask))
# {'some_hash': (<function f at 0x7fab1b72c550>, 1)}
print(dict(b.dask))
# {'some_hash': (<function f at 0x7fab1b72c550>, 2)}
在下面的 dask 代码中,我在执行两个 map_partitions
之前将 x 设置为 1 和 2。结果好像还不错,就是没看懂。
如果 dask 只在找到 compute()
时才等待 运行 这两个 map_partitions
,而此时它发现 compute()
x 是 2,那该多好啊知道第一个 map_partitions
?
pdf = pd.DataFrame({
'id': [1, 1, 1, 2, 2, 3, 4, 1, 2, 2, 1],
'balance': [150, 140, 130, 280, 260, 220, 230, 330, 420, 120, 210]
})
ddf = dd.from_pandas(pdf, npartitions=2)
def func(df, a):
return a
x = 1
ddf['should_be_1'] = ddf.map_partitions(func, x, meta='int')
x = 2
ddf['should_be_2'] = ddf.map_partitions(func, x, meta='int')
ddf.compute()
id balance should_be_1 should_be_2
0 1 150 1 2
1 1 140 1 2
2 1 130 1 2
3 2 280 1 2
4 2 260 1 2
5 3 220 1 2
6 4 230 1 2
7 1 330 1 2
8 2 420 1 2
9 2 120 1 2
10 1 210 1 2
计算被延迟,但是 dask 会跟踪传递给延迟函数的参数值。稍后更改参数值不会更改传递给较早延迟计算的值:
from dask import delayed
@delayed
def f(x):
return x
x = 1
a = f(x)
x = 2
b = f(x)
print(dict(a.dask))
# {'some_hash': (<function f at 0x7fab1b72c550>, 1)}
print(dict(b.dask))
# {'some_hash': (<function f at 0x7fab1b72c550>, 2)}