TensorFlow: InternalError: Blas SGEMM launch failed

Question

当我运行 sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys}) 我得到 InternalError: Blas SGEMM launch failed。这是完整的错误和堆栈跟踪：

InternalErrorTraceback (most recent call last)
<ipython-input-9-a3261a02bdce> in <module>()
      1 batch_xs, batch_ys = mnist.train.next_batch(100)
----> 2 sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})

/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in run(self, fetches, feed_dict, options, run_metadata)
    338     try:
    339       result = self._run(None, fetches, feed_dict, options_ptr,
--> 340                          run_metadata_ptr)
    341       if run_metadata:
    342         proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in _run(self, handle, fetches, feed_dict, options, run_metadata)
    562     try:
    563       results = self._do_run(handle, target_list, unique_fetches,
--> 564                              feed_dict_string, options, run_metadata)
    565     finally:
    566       # The movers are no longer used. Delete them.

/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
    635     if handle is None:
    636       return self._do_call(_run_fn, self._session, feed_dict, fetch_list,
--> 637                            target_list, options, run_metadata)
    638     else:
    639       return self._do_call(_prun_fn, self._session, handle, feed_dict,

/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in _do_call(self, fn, *args)
    657       # pylint: disable=protected-access
    658       raise errors._make_specific_exception(node_def, op, error_message,
--> 659                                             e.code)
    660       # pylint: enable=protected-access
    661 

InternalError: Blas SGEMM launch failed : a.shape=(100, 784), b.shape=(784, 10), m=100, n=10, k=784
     [[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"](_recv_Placeholder_0/_4, Variable/read)]]
Caused by op u'MatMul', defined at:
  File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/local/lib/python2.7/dist-packages/ipykernel/__main__.py", line 3, in <module>
    app.launch_new_instance()
  File "/usr/local/lib/python2.7/dist-packages/traitlets/config/application.py", line 596, in launch_instance
    app.start()
  File "/usr/local/lib/python2.7/dist-packages/ipykernel/kernelapp.py", line 442, in start
    ioloop.IOLoop.instance().start()
  File "/usr/local/lib/python2.7/dist-packages/zmq/eventloop/ioloop.py", line 162, in start
    super(ZMQIOLoop, self).start()
  File "/usr/local/lib/python2.7/dist-packages/tornado/ioloop.py", line 883, in start
    handler_func(fd_obj, events)
  File "/usr/local/lib/python2.7/dist-packages/tornado/stack_context.py", line 275, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/zmq/eventloop/zmqstream.py", line 440, in _handle_events
    self._handle_recv()
  File "/usr/local/lib/python2.7/dist-packages/zmq/eventloop/zmqstream.py", line 472, in _handle_recv
    self._run_callback(callback, msg)
  File "/usr/local/lib/python2.7/dist-packages/zmq/eventloop/zmqstream.py", line 414, in _run_callback
    callback(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tornado/stack_context.py", line 275, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/ipykernel/kernelbase.py", line 276, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "/usr/local/lib/python2.7/dist-packages/ipykernel/kernelbase.py", line 228, in dispatch_shell
    handler(stream, idents, msg)
  File "/usr/local/lib/python2.7/dist-packages/ipykernel/kernelbase.py", line 391, in execute_request
    user_expressions, allow_stdin)
  File "/usr/local/lib/python2.7/dist-packages/ipykernel/ipkernel.py", line 199, in do_execute
    shell.run_cell(code, store_history=store_history, silent=silent)
  File "/usr/local/lib/python2.7/dist-packages/IPython/core/interactiveshell.py", line 2723, in run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "/usr/local/lib/python2.7/dist-packages/IPython/core/interactiveshell.py", line 2825, in run_ast_nodes
    if self.run_code(code, result):
  File "/usr/local/lib/python2.7/dist-packages/IPython/core/interactiveshell.py", line 2885, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-4-d7414c4b6213>", line 4, in <module>
    y = tf.nn.softmax(tf.matmul(x, W) + b)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 1036, in matmul
    name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 911, in _mat_mul
    transpose_b=transpose_b, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/op_def_library.py", line 655, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2154, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1154, in __init__
    self._traceback = _extract_stack()

堆栈：EC2 g2.8xlarge 机器，Ubuntu14.04

Answer 1

我在运行 Tensorflow Distributed 时遇到这个错误。您是否检查过是否有任何工人报告 CUDA_OUT_OF_MEMORY 错误？如果是这种情况，则可能与您放置权重和偏差变量的位置有关。例如

with tf.device("/job:paramserver/task:0/cpu:0"):
   W = weight_variable([input_units, num_hidden_units])       
   b = bias_variable([num_hidden_units])

Answer 2

老问题，但可能对其他人有帮助。
尝试关闭在其他进程中活动的交互式会话（如果 IPython Notebook - 只需重新启动内核）。这对我有帮助！

此外，我在实验期间使用此代码关闭此内核中的本地会话：

if 'session' in locals() and session is not None:
    print('Close interactive session')
    session.close()

Answer 3

我的环境是 Python 3.5，Tensorflow 0.12 和 Windows 10（没有 Docker）。我正在 CPU 和 GPU 中训练神经网络。每当在 GPU 中训练时，我都遇到了同样的错误 InternalError: Blas SGEMM launch failed。

我找不到发生此错误的原因，但我通过避免使用张量流函数 tensorflow.contrib.slim.one_hot_encoding() 设法运行我在 GPU 中的代码。相反，我在 numpy（输入和输出变量）中进行单热编码操作。

以下代码重现了错误并进行了修复。这是使用梯度下降学习 y = x ** 2 函数的最小设置。

import numpy as np
import tensorflow as tf
import tensorflow.contrib.slim as slim

def test_one_hot_encoding_using_tf():

    # This function raises the "InternalError: Blas SGEMM launch failed" when run in the GPU

    # Initialize
    tf.reset_default_graph()
    input_size = 10
    output_size = 100
    input_holder = tf.placeholder(shape=[1], dtype=tf.int32, name='input')
    output_holder = tf.placeholder(shape=[1], dtype=tf.int32, name='output')

    # Define network
    input_oh = slim.one_hot_encoding(input_holder, input_size)
    output_oh = slim.one_hot_encoding(output_holder, output_size)
    W1 = tf.Variable(tf.random_uniform([input_size, output_size], 0, 0.01))
    output_v = tf.matmul(input_oh, W1)
    output_v = tf.reshape(output_v, [-1])

    # Define updates
    loss = tf.reduce_sum(tf.square(output_oh - output_v))
    trainer = tf.train.GradientDescentOptimizer(learning_rate=0.1)
    update_model = trainer.minimize(loss)

    # Optimize
    init = tf.initialize_all_variables()
    steps = 1000

    # Force CPU/GPU
    config = tf.ConfigProto(
        # device_count={'GPU': 0}  # uncomment this line to force CPU
    )

    # Launch the tensorflow graph
    with tf.Session(config=config) as sess:
        sess.run(init)

        for step_i in range(steps):

            # Get sample
            x = np.random.randint(0, 10)
            y = np.power(x, 2).astype('int32')

            # Update
            _, l = sess.run([update_model, loss], feed_dict={input_holder: [x], output_holder: [y]})

        # Check model
        print('Final loss: %f' % l)

def test_one_hot_encoding_no_tf():

    # This function does not raise the "InternalError: Blas SGEMM launch failed" when run in the GPU

    def oh_encoding(label, num_classes):
        return np.identity(num_classes)[label:label + 1].astype('int32')

    # Initialize
    tf.reset_default_graph()
    input_size = 10
    output_size = 100
    input_holder = tf.placeholder(shape=[1, input_size], dtype=tf.float32, name='input')
    output_holder = tf.placeholder(shape=[1, output_size], dtype=tf.float32, name='output')

    # Define network
    W1 = tf.Variable(tf.random_uniform([input_size, output_size], 0, 0.01))
    output_v = tf.matmul(input_holder, W1)
    output_v = tf.reshape(output_v, [-1])

    # Define updates
    loss = tf.reduce_sum(tf.square(output_holder - output_v))
    trainer = tf.train.GradientDescentOptimizer(learning_rate=0.1)
    update_model = trainer.minimize(loss)

    # Optimize
    init = tf.initialize_all_variables()
    steps = 1000

    # Force CPU/GPU
    config = tf.ConfigProto(
        # device_count={'GPU': 0}  # uncomment this line to force CPU
    )

    # Launch the tensorflow graph
    with tf.Session(config=config) as sess:
        sess.run(init)

        for step_i in range(steps):

            # Get sample
            x = np.random.randint(0, 10)
            y = np.power(x, 2).astype('int32')

            # One hot encoding
            x = oh_encoding(x, 10)
            y = oh_encoding(y, 100)

            # Update
            _, l = sess.run([update_model, loss], feed_dict={input_holder: x, output_holder: y})

        # Check model
        print('Final loss: %f' % l)

Answer 4

我遇到了这个问题，通过设置allow_soft_placement=True和gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.3)解决了这个问题，它们专门定义了GPU内存的使用比例。我想这有助于避免两个 tensorflow 进程竞争 GPU 内存。

gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.3)
sess = tf.Session(config=tf.ConfigProto(
  allow_soft_placement=True, log_device_placement=True))

Answer 5

也许你没有正确地释放你的 gpu，如果你正在使用 linux，请尝试 "ps -ef | grep python" 看看哪些作业正在使用 GPU。然后杀了他们

Answer 6

在我的例子中，libcublas.so 所在的网络文件系统已经死了。节点已重新启动，一切正常。只是为了向数据集添加另一个点。

Answer 7

在我的例子中，我打开了 2 个 python 控制台，都使用 keras/tensorflow。当我关闭旧控制台（前一天忘记了）时，一切开始正常工作。

所以最好检查一下，如果你没有多个控制台/进程占用 GPU。

Answer 8

我关闭了所有其他 Jupyter 会话运行，这解决了问题。我认为这是 GPU 内存问题。

Answer 9

我在运行ning Keras CuDNN 与 pytest-xdist 并行测试时遇到了这个错误。解决方案是运行他们串行。

Answer 10

对我来说，我在使用 Keras 时遇到了这个错误，而 Tensorflow 是后端。是因为Anaconda里面的深度学习环境没有正常启动，导致Tensorflow也没有正常启动。自从上次激活我的深度学习环境（称为 dl）后我就注意到了这一点，我的 Anaconda Prompt 中的提示更改为：

(dl) C:\Users\georg\Anaconda3\envs\dl\etc\conda\activate.d>set "KERAS_BACKEND=tensorflow"

虽然之前只有dl。因此，为了消除上述错误，我所做的是关闭我的 jupyter notebook 和 Anaconda 提示，然后重新启动，多次。

Answer 11

最近把OS改成Windows10就遇到了这个错误，以前用windows7的时候从来没有遇到过这个问题。

如果我在另一个 GPU 程序是运行时加载我的 GPU Tensorflow 模型，就会发生错误；这是我作为套接字服务器加载的 JCuda 模型，它并不大。如果我关闭我的其他 GPU 程序，这个 Tensorflow 模型可以非常成功地加载。

这个JCuda程序一点也不大，也就70M左右，相比之下这个Tensorflow模型要500多M，大很多。但是我用的是1080ti，内存很大。所以这可能不是 out-of-memory 问题，它可能是 Tensorflow 关于 OS 或 Cuda 的一些棘手的内部问题。（PS：我使用的是 Cuda 8.0.44 版本，还没有下载更新的版本。）

Answer 12

重新启动我的 Jupyter 进程还不够；我不得不重新启动计算机。

Answer 13

对我来说，当我尝试运行多个 tensorflow 进程（例如 2）并且它们都需要访问 GPU 资源时遇到了这个问题。

一个简单的解决方案是确保一次只能有一个 tensorflow 进程运行ning。

更多详情，您可以查看here。

To be clear, tensorflow will try (by default) to consume all available GPUs. It cannot be run with other programs also active. Closing. Feel free to reopen if this is actually another problem.

Answer 14

就我而言，

首先，我运行

conda clean --all

清理压缩包和未使用的包。

然后，我重新启动 IDE（在本例中为 Pycharm）并且运行良好。环境：anaconda python 3.6，windows 10 64 位。我通过 anaconda 网站上提供的命令安装了 tensorflow-gpu。

Answer 15

就我而言，在单独的服务器中打开 Jupyter Notebooks 就足够了。

只有当我尝试在同一台服务器上使用多个 tensorflow/keras 模型时，才会出现此错误。打开一个笔记本，执行它，而不是关闭并尝试打开另一个笔记本，这并不重要。如果它们被加载到同一个 Jupyter 服务器中，错误总是会发生。

Answer 16

2.0 兼容答案：为 erko 的答案提供 2.0 代码以造福社区。

session = tf.compat.v1.Session()

if 'session' in locals() and session is not None:
    print('Close interactive session')
    session.close()