Pytorch 的 autograd 问题与 joblib
Pytorch's autograd issue with joblib
将 pytorch 的 autograd 与 joblib 混合使用似乎存在问题。我需要为很多样本并行获取梯度。 Joblib 在 pytorch 的其他方面工作得很好,但是,当与 autograd 混合使用时,它会出错。我做了一个非常小的例子,它显示串行版本工作正常但并行版本崩溃。
from joblib import Parallel, delayed
import numpy as np
torch.autograd.set_detect_anomaly(True)
tt = lambda x, grad=True: torch.tensor(x, requires_grad=grad)
def Grad(X, Out):
return autograd.grad(Out, X, create_graph=True, allow_unused=False)[0]
xs, ys = [], []
for i in range(10):
xi = tt(np.random.rand()).float()
yi = xi * xi
xs += [xi]
ys += [yi]
Grads_serial = [Grad(x, y) for x, y in zip(xs, ys)]
print("Grads_serial", Grads_serial)
Grads_parallel = Parallel(n_jobs=2)([delayed(Grad)(x, y) for x, y in zip(xs, ys)])
print("Grads_parallel", Grads_parallel)
错误消息也不是很有帮助:
RuntimeError: One of the differentiated Tensors appears to not have been used in the graph. Set allow_unused=True if this is the desired behavior.
问题是 parallel 使用“loky”作为默认后端,你应该使用“threading”作为后端,这样你的代码将 运行 按预期进行,请参考以下关于 Joblib 的文档并行classJoblib Parallel Class
因此将您提供的代码编辑为以下内容:
from joblib import Parallel, delayed
import numpy as np
import torch
torch.autograd.set_detect_anomaly(True)
tt = lambda x, grad=True: torch.tensor(x, requires_grad=grad)
def Grad(X, Out):
return torch.autograd.grad(Out, X, create_graph=True, allow_unused=False)[0]
xs, ys = [], []
for i in range(10):
xi = tt(np.random.rand()).float()
yi = xi * xi
xs += [xi]
ys += [yi]
Grads_serial = [Grad(x, y) for x, y in zip(xs, ys)]
print("Grads_serial", Grads_serial)
Grads_parallel = Parallel(n_jobs=2, backend="threading")([delayed(Grad)(x, y) for x, y in zip(xs, ys)])
print("Grads_parallel", Grads_parallel)
将给出以下结果:
Grads_serial [tensor(0.6083, grad_fn=<AddBackward0>), tensor(0.0944, grad_fn=<AddBackward0>), tensor(1.8791, grad_fn=<AddBackward0>), tensor(1.5986, grad_fn=<AddBackward0>), tensor(0.4832, grad_fn=<AddBackward0>), tensor(1.3194, grad_fn=<AddBackward0>), tensor(0.4672, grad_fn=<AddBackward0>), tensor(1.0045, grad_fn=<AddBackward0>), tensor(1.8631, grad_fn=<AddBackward0>), tensor(0.2853, grad_fn=<AddBackward0>)]
Grads_parallel [tensor(0.6083, grad_fn=<AddBackward0>), tensor(0.0944, grad_fn=<AddBackward0>), tensor(1.8791, grad_fn=<AddBackward0>), tensor(1.5986, grad_fn=<AddBackward0>), tensor(0.4832, grad_fn=<AddBackward0>), tensor(1.3194, grad_fn=<AddBackward0>), tensor(0.4672, grad_fn=<AddBackward0>), tensor(1.0045, grad_fn=<AddBackward0>), tensor(1.8631, grad_fn=<AddBackward0>), tensor(0.2853, grad_fn=<AddBackward0>)]
希望此回复对您有所帮助,祝您有愉快的一天。
Joblib 没有将与操作关联的图形复制到不同的进程。解决它的一种方法是在作业内部执行计算。
import torch
from torch import autograd
from joblib import Parallel, delayed
import numpy as np
torch.autograd.set_detect_anomaly(False)
tt = lambda x, grad=True: torch.tensor(x, requires_grad=grad)
def Grad(X, Out):
# This will compute yi in the job, and thus will
# create the graph here
yi = Out[0](*Out[1])
# now the differentiation works
return autograd.grad(yi, X, create_graph=True, allow_unused=False)[0]
torch.set_num_threads(1)
xs, ys = [], []
for i in range(10):
xi = tt(np.random.rand()).float()
yi = lambda xi: xi * xi, [xi]
xs += [xi]
ys += [yi]
Grads_serial = [Grad(x, y) for x, y in zip(xs, ys)]
print("Grads_serial", Grads_serial)
Grads_parallel = Parallel(n_jobs=2)([delayed(Grad)(x, y) for x, y in zip(xs, ys)])
print("Grads_parallel", Grads_parallel)
编辑
更多的哲学问题是
(1) 如果您可以简单地向量化您的操作并让 torch 使用 intraoperator 并行性,那么使用 joblib 并行性是否有意义?
(2) mak14 提到使用线程后端,它很好地修复了您的示例。但是多个线程将只使用一个 CPU,这对 IO 有界作业有意义,比如发出 HTTP 请求,但对 CPU 有界操作没有意义。
编辑#2
torch.multiprocessing 的存在表明梯度需要一些特殊处理,您可以尝试使用 torch.multiprocessing
而不是 multiprocessing
或 threading
将后端写入 joblib。
在这里您可以找到关于如何在两个框架中构建图形的概述
https://www.tensorflow.org/guide/intro_to_graphs
https://pytorch.org/blog/computational-graphs-constructed-in-pytorch/
但我担心要给出一个明确的答案,为什么一个有效而另一个无效,则必须研究实施。
将 pytorch 的 autograd 与 joblib 混合使用似乎存在问题。我需要为很多样本并行获取梯度。 Joblib 在 pytorch 的其他方面工作得很好,但是,当与 autograd 混合使用时,它会出错。我做了一个非常小的例子,它显示串行版本工作正常但并行版本崩溃。
from joblib import Parallel, delayed
import numpy as np
torch.autograd.set_detect_anomaly(True)
tt = lambda x, grad=True: torch.tensor(x, requires_grad=grad)
def Grad(X, Out):
return autograd.grad(Out, X, create_graph=True, allow_unused=False)[0]
xs, ys = [], []
for i in range(10):
xi = tt(np.random.rand()).float()
yi = xi * xi
xs += [xi]
ys += [yi]
Grads_serial = [Grad(x, y) for x, y in zip(xs, ys)]
print("Grads_serial", Grads_serial)
Grads_parallel = Parallel(n_jobs=2)([delayed(Grad)(x, y) for x, y in zip(xs, ys)])
print("Grads_parallel", Grads_parallel)
错误消息也不是很有帮助:
RuntimeError: One of the differentiated Tensors appears to not have been used in the graph. Set allow_unused=True if this is the desired behavior.
问题是 parallel 使用“loky”作为默认后端,你应该使用“threading”作为后端,这样你的代码将 运行 按预期进行,请参考以下关于 Joblib 的文档并行classJoblib Parallel Class
因此将您提供的代码编辑为以下内容:
from joblib import Parallel, delayed
import numpy as np
import torch
torch.autograd.set_detect_anomaly(True)
tt = lambda x, grad=True: torch.tensor(x, requires_grad=grad)
def Grad(X, Out):
return torch.autograd.grad(Out, X, create_graph=True, allow_unused=False)[0]
xs, ys = [], []
for i in range(10):
xi = tt(np.random.rand()).float()
yi = xi * xi
xs += [xi]
ys += [yi]
Grads_serial = [Grad(x, y) for x, y in zip(xs, ys)]
print("Grads_serial", Grads_serial)
Grads_parallel = Parallel(n_jobs=2, backend="threading")([delayed(Grad)(x, y) for x, y in zip(xs, ys)])
print("Grads_parallel", Grads_parallel)
将给出以下结果:
Grads_serial [tensor(0.6083, grad_fn=<AddBackward0>), tensor(0.0944, grad_fn=<AddBackward0>), tensor(1.8791, grad_fn=<AddBackward0>), tensor(1.5986, grad_fn=<AddBackward0>), tensor(0.4832, grad_fn=<AddBackward0>), tensor(1.3194, grad_fn=<AddBackward0>), tensor(0.4672, grad_fn=<AddBackward0>), tensor(1.0045, grad_fn=<AddBackward0>), tensor(1.8631, grad_fn=<AddBackward0>), tensor(0.2853, grad_fn=<AddBackward0>)]
Grads_parallel [tensor(0.6083, grad_fn=<AddBackward0>), tensor(0.0944, grad_fn=<AddBackward0>), tensor(1.8791, grad_fn=<AddBackward0>), tensor(1.5986, grad_fn=<AddBackward0>), tensor(0.4832, grad_fn=<AddBackward0>), tensor(1.3194, grad_fn=<AddBackward0>), tensor(0.4672, grad_fn=<AddBackward0>), tensor(1.0045, grad_fn=<AddBackward0>), tensor(1.8631, grad_fn=<AddBackward0>), tensor(0.2853, grad_fn=<AddBackward0>)]
希望此回复对您有所帮助,祝您有愉快的一天。
Joblib 没有将与操作关联的图形复制到不同的进程。解决它的一种方法是在作业内部执行计算。
import torch
from torch import autograd
from joblib import Parallel, delayed
import numpy as np
torch.autograd.set_detect_anomaly(False)
tt = lambda x, grad=True: torch.tensor(x, requires_grad=grad)
def Grad(X, Out):
# This will compute yi in the job, and thus will
# create the graph here
yi = Out[0](*Out[1])
# now the differentiation works
return autograd.grad(yi, X, create_graph=True, allow_unused=False)[0]
torch.set_num_threads(1)
xs, ys = [], []
for i in range(10):
xi = tt(np.random.rand()).float()
yi = lambda xi: xi * xi, [xi]
xs += [xi]
ys += [yi]
Grads_serial = [Grad(x, y) for x, y in zip(xs, ys)]
print("Grads_serial", Grads_serial)
Grads_parallel = Parallel(n_jobs=2)([delayed(Grad)(x, y) for x, y in zip(xs, ys)])
print("Grads_parallel", Grads_parallel)
编辑
更多的哲学问题是
(1) 如果您可以简单地向量化您的操作并让 torch 使用 intraoperator 并行性,那么使用 joblib 并行性是否有意义?
(2) mak14 提到使用线程后端,它很好地修复了您的示例。但是多个线程将只使用一个 CPU,这对 IO 有界作业有意义,比如发出 HTTP 请求,但对 CPU 有界操作没有意义。
编辑#2
torch.multiprocessing 的存在表明梯度需要一些特殊处理,您可以尝试使用 torch.multiprocessing
而不是 multiprocessing
或 threading
将后端写入 joblib。
在这里您可以找到关于如何在两个框架中构建图形的概述
https://www.tensorflow.org/guide/intro_to_graphs
https://pytorch.org/blog/computational-graphs-constructed-in-pytorch/
但我担心要给出一个明确的答案,为什么一个有效而另一个无效,则必须研究实施。