Odd shape in tensor while training & ResourceExhaustedError: OOM when allocating tensor

Odd shape in tensor while training & ResourceExhaustedError: OOM when allocating tensor

我正在尝试 运行 使用此 github repo 对象检测,它利用了简单的 7 层单发多盒检测器。 我 运行 在 Google Colab 上使用软件包:keras==2.2.4 & tensorflow-gpu==1.13.1

最终我 运行 在训练时遇到了下面的这个错误。我想抱怨的另一件事是导致它崩溃的张量的形状有一个形状 [2,1232,1640,48] 哪里...

Epoch 1/5

---------------------------------------------------------------------------

ResourceExhaustedError                    Traceback (most recent call last)

<ipython-input-28-3fbd9e60a593> in <module>()
     19 
     20                               max_queue_size=1,
---> 21                               workers=0)

7 frames

/usr/local/lib/python3.6/dist-packages/keras/legacy/interfaces.py in wrapper(*args, **kwargs)
     89                 warnings.warn('Update your `' + object_name + '` call to the ' +
     90                               'Keras 2 API: ' + signature, stacklevel=2)
---> 91             return func(*args, **kwargs)
     92         wrapper._original_function = func
     93         return wrapper

/usr/local/lib/python3.6/dist-packages/keras/engine/training.py in fit_generator(self, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
   1416             use_multiprocessing=use_multiprocessing,
   1417             shuffle=shuffle,
-> 1418             initial_epoch=initial_epoch)
   1419 
   1420     @interfaces.legacy_generator_methods_support

/usr/local/lib/python3.6/dist-packages/keras/engine/training_generator.py in fit_generator(model, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
    215                 outs = model.train_on_batch(x, y,
    216                                             sample_weight=sample_weight,
--> 217                                             class_weight=class_weight)
    218 
    219                 outs = to_list(outs)

/usr/local/lib/python3.6/dist-packages/keras/engine/training.py in train_on_batch(self, x, y, sample_weight, class_weight)
   1215             ins = x + y + sample_weights
   1216         self._make_train_function()
-> 1217         outputs = self.train_function(ins)
   1218         return unpack_singleton(outputs)
   1219 

/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py in __call__(self, inputs)
   2713                 return self._legacy_call(inputs)
   2714 
-> 2715             return self._call(inputs)
   2716         else:
   2717             if py_any(is_tensor(x) for x in inputs):

/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py in _call(self, inputs)
   2673             fetched = self._callable_fn(*array_vals, run_metadata=self.run_metadata)
   2674         else:
-> 2675             fetched = self._callable_fn(*array_vals)
   2676         return fetched[:len(self.outputs)]
   2677 

/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in __call__(self, *args, **kwargs)
   1437           ret = tf_session.TF_SessionRunCallable(
   1438               self._session._session, self._handle, args, status,
-> 1439               run_metadata_ptr)
   1440         if run_metadata:
   1441           proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/errors_impl.py in __exit__(self, type_arg, value_arg, traceback_arg)
    526             None, None,
    527             compat.as_text(c_api.TF_Message(self.status.status)),
--> 528             c_api.TF_GetCode(self.status.status))
    529     # Delete the underlying status object from memory otherwise it stays alive
    530     # as there is a reference to status from this from the traceback due to

ResourceExhaustedError: OOM when allocating tensor with shape[2,1232,1640,48] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node training/Adam/gradients/zeros_22-0-1-TransposeNCHWToNHWC-LayoutOptimizer}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[{{node loss/add_14}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

请说明发生了什么以及如何解决这个问题。如果这有助于找到错误,我还可以分享更多关于模型结构的相关细节。

如果你有形状

的数据
(2, 1232, 1640, 3)

在通过具有 "SAME" 填充的 42 个过滤器的卷积层后,它将具有形状

(2, 1232, 1640, 42)

而且你的 GPU 上没有这个张量的位置。

我查看了 repo,有一堆层有 48 个过滤器

conv2 = Conv2D(48, (3, 3), strides=(1, 1), padding="same", 
    kernel_initializer='he_normal', 
    kernel_regularizer=l2(l2_reg), name='conv2')(pool1)
conv2 = BatchNormalization(axis=3, momentum=0.99, name='bn2')(conv2)
conv2 = ELU(name='elu2')(conv2)
pool2 = MaxPooling2D(pool_size=(2, 2), name='pool2')(conv2)