无法在具有 GPU 的机器上安装没有 CUDA 支持的 Keras MXNet

Cannot install Keras MXNet without CUDA support on a machine with GPUs

我明确尝试安装一个不支持 CUDA 的 mxnet 版本。 当安装支持 cuda 时,我可以 运行 this example here. I am following the keras & mxnet installation guide here.

重现成功启用 CUDA 的 keras-mxnet 的步骤:

这是我的 gpu 配置来自 nvcc --version:

~# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Tue_Jan_10_13:22:03_CST_2017
Cuda compilation tools, release 8.0, V8.0.61

确保您没有安装 mxnet

pip install mxnet-cu80
pip install keras-mxnet

运行 jupyter 上的代码给我:

60000 train samples
10000 test samples
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 512)               401920    
_________________________________________________________________
activation_1 (Activation)    (None, 512)               0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 512)               262656    
_________________________________________________________________
activation_2 (Activation)    (None, 512)               0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 10)                5130      
_________________________________________________________________
activation_3 (Activation)    (None, 10)                0         
=================================================================
Total params: 669,706
Trainable params: 669,706
Non-trainable params: 0
_________________________________________________________________
Train on 60000 samples, validate on 10000 samples
Epoch 1/20
 6400/60000 [==>...........................] - ETA: 39s - loss: 2.1718 - acc: 0.2587 
/usr/local/lib/python3.6/dist-packages/mxnet/module/bucketing_module.py:408: UserWarning: Optimizer created manually outside Module but rescale_grad is not normalized to 1.0/batch_size/num_workers (1.0 vs. 0.0078125). Is this intended?
  force_init=force_init)
60000/60000 [==============================] - 6s 103us/step - loss: 1.2105 - acc: 0.6957 - val_loss: 0.5334 - val_acc: 0.8728
Epoch 2/20
60000/60000 [==============================] - 2s 27us/step - loss: 0.5280 - acc: 0.8515 - val_loss: 0.3749 - val_acc: 0.8996
Epoch 3/20
60000/60000 [==============================] - 2s 28us/step - loss: 0.4239 - acc: 0.8786 - val_loss: 0.3213 - val_acc: 0.9098
Epoch 4/20
60000/60000 [==============================] - 2s 28us/step - loss: 0.3740 - acc: 0.8911 - val_loss: 0.2923 - val_acc: 0.9162
Epoch 5/20
60000/60000 [==============================] - 2s 28us/step - loss: 0.3437 - acc: 0.9008 - val_loss: 0.2704 - val_acc: 0.9218
Epoch 6/20
60000/60000 [==============================] - 2s 28us/step - loss: 0.3195 - acc: 0.9079 - val_loss: 0.2539 - val_acc: 0.9263
Epoch 7/20
60000/60000 [==============================] - 2s 29us/step - loss: 0.2965 - acc: 0.9151 - val_loss: 0.2393 - val_acc: 0.9312
Epoch 8/20
60000/60000 [==============================] - 2s 28us/step - loss: 0.2792 - acc: 0.9190 - val_loss: 0.2264 - val_acc: 0.9342
Epoch 9/20
60000/60000 [==============================] - 2s 28us/step - loss: 0.2641 - acc: 0.9239 - val_loss: 0.2173 - val_acc: 0.9363
Epoch 10/20
60000/60000 [==============================] - 2s 28us/step - loss: 0.2520 - acc: 0.9277 - val_loss: 0.2064 - val_acc: 0.9413
Epoch 11/20
60000/60000 [==============================] - 2s 29us/step - loss: 0.2409 - acc: 0.9306 - val_loss: 0.1983 - val_acc: 0.9425
Epoch 12/20
60000/60000 [==============================] - 2s 30us/step - loss: 0.2307 - acc: 0.9331 - val_loss: 0.1894 - val_acc: 0.9447
Epoch 13/20
60000/60000 [==============================] - 2s 28us/step - loss: 0.2209 - acc: 0.9362 - val_loss: 0.1813 - val_acc: 0.9463
Epoch 14/20
60000/60000 [==============================] - 2s 28us/step - loss: 0.2106 - acc: 0.9396 - val_loss: 0.1756 - val_acc: 0.9478
Epoch 15/20
60000/60000 [==============================] - 2s 28us/step - loss: 0.2044 - acc: 0.9410 - val_loss: 0.1687 - val_acc: 0.9501
Epoch 16/20
60000/60000 [==============================] - 2s 28us/step - loss: 0.1963 - acc: 0.9424 - val_loss: 0.1625 - val_acc: 0.9528
Epoch 17/20
60000/60000 [==============================] - 2s 28us/step - loss: 0.1912 - acc: 0.9436 - val_loss: 0.1576 - val_acc: 0.9542
Epoch 18/20
60000/60000 [==============================] - 2s 28us/step - loss: 0.1842 - acc: 0.9472 - val_loss: 0.1544 - val_acc: 0.9541
Epoch 19/20
60000/60000 [==============================] - 2s 28us/step - loss: 0.1782 - acc: 0.9482 - val_loss: 0.1490 - val_acc: 0.9553
Epoch 20/20
60000/60000 [==============================] - 2s 28us/step - loss: 0.1729 - acc: 0.9494 - val_loss: 0.1447 - val_acc: 0.9570
Test score: 0.144698123593
Test accuracy: 0.957

重现不成功的步骤CPU-only keras-mxnet:

和以前一样,但不是安装 mxnet-cu80,而是安装 mxnet:

pip uninstall mxnet-cu80
pip install mxnet

运行 jupyter notebook 上的代码现在给我:

60000 train samples
10000 test samples
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_4 (Dense)              (None, 512)               401920    
_________________________________________________________________
activation_4 (Activation)    (None, 512)               0         
_________________________________________________________________
dropout_3 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_5 (Dense)              (None, 512)               262656    
_________________________________________________________________
activation_5 (Activation)    (None, 512)               0         
_________________________________________________________________
dropout_4 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_6 (Dense)              (None, 10)                5130      
_________________________________________________________________
activation_6 (Activation)    (None, 10)                0         
=================================================================
Total params: 669,706
Trainable params: 669,706
Non-trainable params: 0
_________________________________________________________________
Train on 60000 samples, validate on 10000 samples
Epoch 1/20
---------------------------------------------------------------------------
MXNetError                                Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/mxnet/symbol/symbol.py in simple_bind(self, ctx, grad_req, type_dict, stype_dict, group2ctx, shared_arg_names, shared_exec, shared_buffer, **kwargs)
   1512                                                  shared_exec_handle,
-> 1513                                                  ctypes.byref(exe_handle)))
   1514         except MXNetError as e:

/usr/local/lib/python3.6/dist-packages/mxnet/base.py in check_call(ret)
    148     if ret != 0:
--> 149         raise MXNetError(py_str(_LIB.MXGetLastError()))
    150 

MXNetError: [04:19:54] src/storage/storage.cc:123: Compile with USE_CUDA=1 to enable GPU usage

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x1c05f2) [0x7f737ac845f2]
[bt] (1) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x1c0bd8) [0x7f737ac84bd8]
[bt] (2) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x2d7d3cd) [0x7f737d8413cd]
[bt] (3) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x2d8141d) [0x7f737d84541d]
[bt] (4) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x2d83206) [0x7f737d847206]
[bt] (5) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x27a2831) [0x7f737d266831]
[bt] (6) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x27a2984) [0x7f737d266984]
[bt] (7) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x27aecec) [0x7f737d272cec]
[bt] (8) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x27b55f8) [0x7f737d2795f8]
[bt] (9) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x27c163a) [0x7f737d28563a]



During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
<ipython-input-4-c71d8965f0f3> in <module>()
     49 history = model.fit(X_train, Y_train,
     50                     batch_size=batch_size, epochs=nb_epoch,
---> 51                     verbose=1, validation_data=(X_test, Y_test))
     52 score = model.evaluate(X_test, Y_test, verbose=0)
     53 print('Test score:', score[0])

/usr/local/lib/python3.6/dist-packages/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, **kwargs)
   1042                                         initial_epoch=initial_epoch,
   1043                                         steps_per_epoch=steps_per_epoch,
-> 1044                                         validation_steps=validation_steps)
   1045 
   1046     def evaluate(self, x=None, y=None,

/usr/local/lib/python3.6/dist-packages/keras/engine/training_arrays.py in fit_loop(model, f, ins, out_labels, batch_size, epochs, verbose, callbacks, val_f, val_ins, shuffle, callback_metrics, initial_epoch, steps_per_epoch, validation_steps)
    197                     ins_batch[i] = ins_batch[i].toarray()
    198 
--> 199                 outs = f(ins_batch)
    200                 if not isinstance(outs, list):
    201                     outs = [outs]

/usr/local/lib/python3.6/dist-packages/keras/backend/mxnet_backend.py in train_function(inputs)
   4794             def train_function(inputs):
   4795                 self._check_trainable_weights_consistency()
-> 4796                 data, label, _, data_shapes, label_shapes = self._adjust_module(inputs, 'train')
   4797 
   4798                 batch = mx.io.DataBatch(data=data, label=label, bucket_key='train',

/usr/local/lib/python3.6/dist-packages/keras/backend/mxnet_backend.py in _adjust_module(self, inputs, phase)
   4746                     self._set_weights()
   4747                 else:
-> 4748                     self._module.bind(data_shapes=data_shapes, label_shapes=None, for_training=True)
   4749                     self._set_weights()
   4750                     self._module.init_optimizer(kvstore=self._kvstore, optimizer=self.optimizer)

/usr/local/lib/python3.6/dist-packages/mxnet/module/bucketing_module.py in bind(self, data_shapes, label_shapes, for_training, inputs_need_grad, force_rebind, shared_module, grad_req)
    341                         compression_params=self._compression_params)
    342         module.bind(data_shapes, label_shapes, for_training, inputs_need_grad,
--> 343                     force_rebind=False, shared_module=None, grad_req=grad_req)
    344         self._curr_module = module
    345         self._curr_bucket_key = self._default_bucket_key

/usr/local/lib/python3.6/dist-packages/mxnet/module/module.py in bind(self, data_shapes, label_shapes, for_training, inputs_need_grad, force_rebind, shared_module, grad_req)
    428                                                      fixed_param_names=self._fixed_param_names,
    429                                                      grad_req=grad_req, group2ctxs=self._group2ctxs,
--> 430                                                      state_names=self._state_names)
    431         self._total_exec_bytes = self._exec_group._total_exec_bytes
    432         if shared_module is not None:

/usr/local/lib/python3.6/dist-packages/mxnet/module/executor_group.py in __init__(self, symbol, contexts, workload, data_shapes, label_shapes, param_names, for_training, inputs_need_grad, shared_group, logger, fixed_param_names, grad_req, state_names, group2ctxs)
    263         self.num_outputs = len(self.symbol.list_outputs())
    264 
--> 265         self.bind_exec(data_shapes, label_shapes, shared_group)
    266 
    267     def decide_slices(self, data_shapes):

/usr/local/lib/python3.6/dist-packages/mxnet/module/executor_group.py in bind_exec(self, data_shapes, label_shapes, shared_group, reshape)
    359             else:
    360                 self.execs.append(self._bind_ith_exec(i, data_shapes_i, label_shapes_i,
--> 361                                                       shared_group))
    362 
    363         self.data_shapes = data_shapes

/usr/local/lib/python3.6/dist-packages/mxnet/module/executor_group.py in _bind_ith_exec(self, i, data_shapes, label_shapes, shared_group)
    637                                            type_dict=input_types, shared_arg_names=self.param_names,
    638                                            shared_exec=shared_exec, group2ctx=group2ctx,
--> 639                                            shared_buffer=shared_data_arrays, **input_shapes)
    640         self._total_exec_bytes += int(executor.debug_str().split('\n')[-3].split()[1])
    641         return executor

/usr/local/lib/python3.6/dist-packages/mxnet/symbol/symbol.py in simple_bind(self, ctx, grad_req, type_dict, stype_dict, group2ctx, shared_arg_names, shared_exec, shared_buffer, **kwargs)
   1517                 error_msg += "%s: %s\n" % (k, v)
   1518             error_msg += "%s" % e
-> 1519             raise RuntimeError(error_msg)
   1520 
   1521         # update shared_buffer

RuntimeError: simple_bind error. Arguments:
/dense_4_input1: (128, 784)
[04:19:54] src/storage/storage.cc:123: Compile with USE_CUDA=1 to enable GPU usage

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x1c05f2) [0x7f737ac845f2]
[bt] (1) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x1c0bd8) [0x7f737ac84bd8]
[bt] (2) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x2d7d3cd) [0x7f737d8413cd]
[bt] (3) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x2d8141d) [0x7f737d84541d]
[bt] (4) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x2d83206) [0x7f737d847206]
[bt] (5) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x27a2831) [0x7f737d266831]
[bt] (6) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x27a2984) [0x7f737d266984]
[bt] (7) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x27aecec) [0x7f737d272cec]
[bt] (8) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x27b55f8) [0x7f737d2795f8]
[bt] (9) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x27c163a) [0x7f737d28563a]

这个错误到底是什么意思?我该如何解决这个问题?

发生这种情况是因为 model.compile 使用 CPU 或 GPU,具体取决于机器中是否有 GPU。看起来它不会检查是否安装了 MXNet 的 GPU 版本。您可以通过明确指定上下文来强制 model.compile 使用 CPU。示例:

model.compile(loss='categorical_crossentropy',
              optimizer=RMSprop(),
              metrics=['accuracy'],
              context=["cpu()"])