Keras/Theano:训练期间节点编译失败

Keras/Theano: Node compilation failed during training

我正在尝试在 Mac OS X 上训练一个已经编译好的 Keras 模型,但出现以下错误:

Problem occurred during compilation with the command line below:
/usr/bin/clang++ -dynamiclib -g -O3 -fno-math-errno -Wno-unused-label -Wno-unused-variable -Wno-write-strings -march=haswell -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION -m64 -fPIC -undefined dynamic_lookup -I/usr/local/lib/python2.7/site-packages/numpy/core/include -I/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/include/python2.7 -I/usr/local/lib/python2.7/site-packages/theano/gof -L/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib -fvisibility=hidden -o /Users/valencra/.theano/compiledir_Darwin-16.4.0-x86_64-i386-64bit-i386-2.7.13-64/tmp9ahb_h/c6acccb2fd68eac67ca5b0f0fb9ad9bb.so /Users/valencra/.theano/compiledir_Darwin-16.4.0-x86_64-i386-64bit-i386-2.7.13-64/tmp9ahb_h/mod.cpp
/Users/valencra/.theano/compiledir_Darwin-16.4.0-x86_64-i386-64bit-i386-2.7.13-64/tmp9ahb_h/mod.cpp:894:21: warning: comparison of array 'outputs' equal to a null pointer is always false [-Wtautological-pointer-compare]
                if (outputs == NULL) {
                    ^~~~~~~    ~~~~
/Users/valencra/.theano/compiledir_Darwin-16.4.0-x86_64-i386-64bit-i386-2.7.13-64/tmp9ahb_h/mod.cpp:919:54: error: arithmetic on a pointer to void
                                    PyArray_DATA(V3) + data_offset,
                                    ~~~~~~~~~~~~~~~~ ^
1 warning and 1 error generated.

Traceback (most recent call last):
  File "osr.py", line 359, in <module>
    osr.train_osr_model()
  File "osr.py", line 88, in train_osr_model
    nb_worker=1)
  File "/usr/local/lib/python2.7/site-packages/keras/engine/training.py", line 1454, in fit_generator
    self._make_train_function()
  File "/usr/local/lib/python2.7/site-packages/keras/engine/training.py", line 767, in _make_train_function
    **self._function_kwargs)
  File "/usr/local/lib/python2.7/site-packages/keras/backend/theano_backend.py", line 969, in function
    return Function(inputs, outputs, updates=updates, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/keras/backend/theano_backend.py", line 955, in __init__
    **kwargs)
  File "/usr/local/lib/python2.7/site-packages/theano/compile/function.py", line 326, in function
    output_keys=output_keys)
  File "/usr/local/lib/python2.7/site-packages/theano/compile/pfunc.py", line 486, in pfunc
    output_keys=output_keys)
  File "/usr/local/lib/python2.7/site-packages/theano/compile/function_module.py", line 1795, in orig_function
    defaults)
  File "/usr/local/lib/python2.7/site-packages/theano/compile/function_module.py", line 1661, in create
    input_storage=input_storage_lists, storage_map=storage_map)
  File "/usr/local/lib/python2.7/site-packages/theano/gof/link.py", line 699, in make_thunk
    storage_map=storage_map)[:3]
  File "/usr/local/lib/python2.7/site-packages/theano/gof/vm.py", line 1063, in make_all
    impl=impl))
  File "/usr/local/lib/python2.7/site-packages/theano/gof/op.py", line 924, in make_thunk
    no_recycling)
  File "/usr/local/lib/python2.7/site-packages/theano/gof/op.py", line 828, in make_c_thunk
    output_storage=node_output_storage)
  File "/usr/local/lib/python2.7/site-packages/theano/gof/cc.py", line 1190, in make_thunk
    keep_lock=keep_lock)
  File "/usr/local/lib/python2.7/site-packages/theano/gof/cc.py", line 1131, in __compile__
    keep_lock=keep_lock)
  File "/usr/local/lib/python2.7/site-packages/theano/gof/cc.py", line 1586, in cthunk_factory
    key=key, lnk=self, keep_lock=keep_lock)
  File "/usr/local/lib/python2.7/site-packages/theano/gof/cmodule.py", line 1155, in module_from_key
    module = lnk.compile_cmodule(location)
  File "/usr/local/lib/python2.7/site-packages/theano/gof/cc.py", line 1489, in compile_cmodule
    preargs=preargs)
  File "/usr/local/lib/python2.7/site-packages/theano/gof/cmodule.py", line 2304, in compile_str
    (status, compile_stderr.replace('\n', '. ')))
Exception: ('The following error happened while compiling the node', Split{4}(InplaceDimShuffle{1,0,2}.0, TensorConstant{2}, TensorConstant{(4,) of 256}), '\n', "Compilation failed (return status=1): /Users/valencra/.theano/compiledir_Darwin-16.4.0-x86_64-i386-64bit-i386-2.7.13-64/tmp9ahb_h/mod.cpp:894:21: warning: comparison of array 'outputs' equal to a null pointer is always false [-Wtautological-pointer-compare].                 if (outputs == NULL) {.                     ^~~~~~~    ~~~~. /Users/valencra/.theano/compiledir_Darwin-16.4.0-x86_64-i386-64bit-i386-2.7.13-64/tmp9ahb_h/mod.cpp:919:54: error: arithmetic on a pointer to void.                                     PyArray_DATA(V3) + data_offset,.                                     ~~~~~~~~~~~~~~~~ ^. 1 warning and 1 error generated.. ", '[*1 -> Split{4}(<TensorType(float32, 3D)>, TensorConstant{2}, TensorConstant{(4,) of 256}), *1::1, *1::2, *1::3]')

我更新了 Keras 和 Theano,但问题仍然存在。我很困惑,因为就在几天前训练完全相同的模型没有这个问题。以下是训练期间使用的函数:

def train_osr_model(self):
    """ Train the optical speech recognizer
    """
    print "\nTraining OSR"
    validation_ratio = 0.3
    batch_size = 32
    with h5py.File(self.training_save_fn, "r") as training_save_file:
        sample_count = int(training_save_file.attrs["sample_count"])
        sample_idxs = range(0, sample_count)
        sample_idxs = np.random.permutation(sample_idxs)
        training_sample_idxs = sample_idxs[0:int((1-validation_ratio)*sample_count)]
        validation_sample_idxs = sample_idxs[int((1-validation_ratio)*sample_count):]
        training_sequence_generator = self.generate_training_sequences(batch_size=batch_size, 
                                                                       training_save_file=training_save_file,
                                                                       training_sample_idxs=training_sample_idxs)
        validation_sequence_generator = self.generate_validation_sequences(batch_size=batch_size, 
                                                                           training_save_file=training_save_file,
                                                                           validation_sample_idxs=validation_sample_idxs)

        print "Sample Idxs: {0}\n".format(sample_idxs) # FOR DEBUG ONLY
        print "Training Idxs: {0}\n".format(training_sample_idxs) # FOR DEBUG ONLY
        print "Validation Idxs: {0}\n".format(validation_sample_idxs) # FOR DEBUG ONLY

        pbi = ProgressDisplay()
        self.osr.fit_generator(generator=training_sequence_generator,
                               validation_data=validation_sequence_generator,
                               samples_per_epoch=len(training_sample_idxs),
                               nb_val_samples=len(validation_sample_idxs),
                               nb_epoch=10,
                               max_q_size=1,
                               verbose=2,
                               callbacks=[pbi],
                               class_weight=None,
                               nb_worker=1)

def generate_training_sequences(self, batch_size, training_save_file, training_sample_idxs):
    """ Generates training sequences from HDF5 file on demand
    """
    while True:
        # generate sequences for training
        training_sample_count = len(training_sample_idxs)
        batches = int(training_sample_count/batch_size)
        remainder_samples = training_sample_count%batch_size
        if remainder_samples:
            batches = batches + 1
        # generate batches of samples
        for idx in xrange(0, batches):
            if idx == batches - 1:
                batch_idxs = training_sample_idxs[idx*batch_size:]
            else:
                batch_idxs = training_sample_idxs[idx*batch_size:idx*batch_size+batch_size]

            print batch_idxs # FOR DEBUG ONLY

            X = training_save_file["X"][batch_idxs]
            Y = training_save_file["Y"][batch_idxs]

            yield (np.array(X), np.array(Y))

def generate_validation_sequences(self, batch_size, training_save_file, validation_sample_idxs):
    while True:
        # generate sequences for validation
        validation_sample_count = len(validation_sample_idxs)
        batches = int(validation_sample_count/batch_size)
        remainder_samples = validation_sample_count%batch_size
        if remainder_samples:
            batches = batches + 1
        # generate batches of samples
        for idx in xrange(0, batches):
            if idx == batches - 1:
                batch_idxs = validation_sample_idxs[idx*batch_size:]
            else:
                batch_idxs = validation_sample_idxs[idx*batch_size:idx*batch_size+batch_size]

            print batch_idxs # FOR DEBUG ONLY

            X = training_save_file["X"][batch_idxs]
            Y = training_save_file["Y"][batch_idxs]

            yield (np.array(X), np.array(Y))

供参考,这是正在训练的模型:

def generate_osr_model(self):
    """ Builds the optical speech recognizer model
    """
    print "".join(["\nGenerating OSR model\n",
                   "-"*40])
    with h5py.File(self.training_save_fn, "r") as training_save_file:
        class_count = len(training_save_file.attrs["training_classes"].split(","))
    video = Input(shape=(self.frames_per_sequence,
                         3,
                         self.rows,
                         self.columns))
    cnn_base = VGG16(input_shape=(3,
                                  self.rows, 
                                  self.columns),
                     weights="imagenet",
                     include_top=False)
    cnn_out = GlobalAveragePooling2D()(cnn_base.output)
    cnn = Model(input=cnn_base.input, output=cnn_out)
    cnn.trainable = False
    encoded_frames = TimeDistributed(cnn)(video)
    encoded_vid = LSTM(256)(encoded_frames)
    hidden_layer = Dense(output_dim=1024, activation="relu")(encoded_vid)
    outputs = Dense(output_dim=class_count, activation="softmax")(hidden_layer)
    osr = Model([video], outputs)
    optimizer = Nadam(lr=0.002,
                      beta_1=0.9,
                      beta_2=0.999,
                      epsilon=1e-08,
                      schedule_decay=0.004)
    osr.compile(loss="categorical_crossentropy",
                optimizer=optimizer,
                metrics=["categorical_accuracy"])
    self.osr = osr
    print " * OSR MODEL GENERATED * "

模型摘要:

Generating OSR model
----------------------------------------
 * OSR MODEL GENERATED *

*** MODEL SUMMARY ***
____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to
====================================================================================================
input_1 (InputLayer)             (None, 30, 3, 100, 15 0
____________________________________________________________________________________________________
timedistributed_1 (TimeDistribut (None, 30, 512)       14714688    input_1[0][0]
____________________________________________________________________________________________________
lstm_1 (LSTM)                    (None, 256)           787456      timedistributed_1[0][0]
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 1024)          263168      lstm_1[0][0]
____________________________________________________________________________________________________
dense_2 (Dense)                  (None, 3)             3075        dense_1[0][0]
====================================================================================================
Total params: 15,768,387
Trainable params: 1,053,699
Non-trainable params: 14,714,688

问题似乎源于从 github 存储库安装 Theano 和 Keras,如下所示:

pip install git+git://github.com/Theano/Theano.git
pip install git+git://github.com/fchollet/keras.git

我通过卸载 Theano 和 Keras 来修复它,然后使用 pip 直接安装它们:

pip uninstall Theano
pip uninstall keras
pip install Theano
pip install keras

Theano 或 Keras 的前沿版本可能存在问题。希望这对其他人也有帮助。

编辑:看来这个问题确实来自 Theano 的 master 分支。按照我在 Theano 的存储库上发布的问题进行潜在的永久修复 https://github.com/Theano/Theano/issues/5655