Theano: cublasSgemm failed (14) 内部操作失败

Question

有时，运行一段时间后，Theano / CUDA 会出现这样的错误：

RuntimeError: cublasSgemm failed (14) an internal operation failed
 unit=0 N=0, c.dims=[512 2048], a.dim=[512 493], alpha=%f, beta=%f, a=%p, b=%p, c=%p sa_0=%d, sa_1=%d, sb_0=%d, sb_1=%d, sc_0=%d, sc_1=%d
Apply node that caused the error: GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
Inputs types: [CudaNdarrayType(float32, matrix), CudaNdarrayType(float32, matrix)]
Inputs shapes: [(512, 493), (493, 2048)]
Inputs strides: [(493, 1), (2048, 1)]
Inputs values: ['not shown', 'not shown']

因为我的代码在一段时间内运行良好（我进行神经网络训练，它大部分时间都在运行，即使发生此错误，它也已经运行超过 2000 个小批量），我想知道这是什么原因。也许是一些硬件故障？

这是 CUDA 6.0 和最近的 Theano（昨天来自 Git），Ubuntu 12.04，GTX 580。

我在 K20 上使用 CUDA 6.5 时也遇到了错误：

RuntimeError: cublasSgemm failed (14) an internal operation failed
 unit=0 N=0, c.dims=[2899 2000], a.dim=[2899 493], alpha=%f, beta=%f, a=%p, b=%p, c=%p sa_0=%d, sa_1=%d, sb_0=%d, sb_1=%d, sc_0=%d, sc_1=%d
Apply node that caused the error: GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
Inputs types: [CudaNdarrayType(float32, matrix), CudaNdarrayType(float32, matrix)]
Inputs shapes: [(2899, 493), (493, 2000)]
Inputs strides: [(493, 1), (2000, 1)]
Inputs values: ['not shown', 'not shown']

（我过去有时遇到的另一个错误现在是 this。不确定这是否相关。）

通过Markus，谁得到了同样的错误：

RuntimeError: cublasSgemm failed (14) an internal operation failed
 unit=0 N=0, c.dims=[2 100], a.dim=[2 9919], alpha=%f, beta=%f, a=%p, b=%p, c=%p sa_0=%d, sa_1=%d, sb_0=%d, sb_1=%d, sc_0=%d, sc_1=%d
Apply node that caused the error: GpuDot22(GpuFlatten{2}.0, weight_hidden_)
Inputs types: [CudaNdarrayType(float32, matrix), CudaNdarrayType(float32, matrix)]
Inputs shapes: [(2, 9919), (9919, 100)]
Inputs strides: [(9919, 1), (100, 1)]
Inputs values: ['not shown', 'not shown']

使用 CUDA 6.5、Windows8.1、Python2.7、GTX 970M。

The error only occurs in my own network, if I run the LeNet example from Theano, it runs fine. Though the network is compiling and running fine on the CPU (and also on the GPU for some colleagues using Linux). Does anyone have an idea what the problem could be?

Answer 1

仅供参考，以防有人偶然发现：

这对我来说不再发生了。我不确定是什么修复了它，但我认为主要区别在于我避免了任何多线程和分叉（没有 exec）。这导致了许多类似的问题，例如Theano CUDA error: an illegal memory access was encountered (Whosebug), and Theano CUDA error: an illegal memory access was encountered (Google Groups discussion)。特别是关于 Google 群组的讨论非常有帮助。

Theano 函数不是多线程安全的。然而，这不是一个对我来说是个问题，因为我只在一个线程中使用它。但是，我仍然认为其他线程可能会导致这些问题。也许是与 Python 的 GC 有关，它释放了一些 Cuda_Ndarray 其他线程，而 theano.function 是运行。

我看了一下 relevant Theano code 不确定它是否涵盖所有这些情况。

请注意，您甚至可能没有意识到自己有一些背景知识线程。一些 Python stdlib 代码可以产生这样的后台线程。例如。 multiprocessing.Queue 会这样做。

我无法避免有多个线程，直到在 Theano 中解决这个问题，我创建了一个新的子进程我在一个线程中完成所有 Theano 工作。这也有几个优点，例如：更清晰的代码分离，被在某些情况下更快，因为它实际上是并行运行的，并且能够使用多个 GPU。

请注意，仅使用多处理模块对我不起作用那很好，因为有一些库（Numpy 和其他库，也许 Theano 本身）在分叉过程中可能表现不好（取决于在版本上，OS 和竞争条件）。因此，我需要一个真正的子进程（fork + exec，不仅仅是 fork）。

我的代码是here，以防有人对此感兴趣。

有 ExecingProcess 是仿照 multiprocessing.Process 但执行 fork+exec。（顺便说一句，在 Windows 上，多处理模块无论如何都会这样做，因为 Windows 上没有叉子。）并且有 AsyncTask 将一个双工管道添加到这个工作同时使用 ExecingProcess 和标准 multiprocessing.Process.

另请参阅：Theano Wiki: Using multiple GPUs

Answer 2

运行进入类似的问题，fwiw，在我的例子中，它是通过消除导入另一个使用 pycuda 的库来解决的。看来theano是真的不爱分享了

Theano: cublasSgemm failed (14) 内部操作失败

Theano: cublasSgemm failed (14) an internal operation failed

cuda

cublas

theano