Keras BatchNormalization layer : InternalError: cuDNN launch failure
Keras BatchNormalization layer : InternalError: cuDNN launch failure
我的 Keras 模型(使用 Tensorflow)的 BatchNormalization 层不起作用,return 在训练时出现 InternalError 异常。
这是在我的模型中定义 BatchNormalization 层的行:
bn = BatchNormalization(axis=3)(grid)
我创建了 2 个模型(1 个之前,1 个之后)以调试模型:
debug = Model(inputs=[question1, question2], outputs=grid)
debug.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
bn = BatchNormalization(axis=3)(grid)
debug2 = Model(inputs=[question1, question2], outputs=bn)
debug2.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
然后我预测一些随机数据,只是为了实际预测任何东西:
pred = debug.predict([Q1_test_debug, Q2_test_debug], verbose=1, batch_size=1)
print(pred[0].shape)
pred = debug2.predict([Q1_test_debug, Q2_test_debug], verbose=1, batch_size=1)
print(pred[0].shape)
结果是:
(2, 25)
2/2 [==============================] - 2s 1s/step
(25, 25, 600)
---------------------------------------------------------------------------
InternalError Traceback (most recent call last)
~/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
1291 try:
-> 1292 return fn(*args)
1293 except errors.OpError as e:
~/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py in _run_fn(feed_dict, fetch_list, target_list, options, run_metadata)
1276 return self._call_tf_sessionrun(
-> 1277 options, feed_dict, fetch_list, target_list, run_metadata)
1278
~/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py in _call_tf_sessionrun(self, options, feed_dict, fetch_list, target_list, run_metadata)
1366 self._session, options, feed_dict, fetch_list, target_list,
-> 1367 run_metadata)
1368
InternalError: cuDNN launch failure : input shape ([1,600,25,25])
[[{{node batch_normalization_1/FusedBatchNorm}} = FusedBatchNorm[T=DT_FLOAT, _class=["loc:@batch_normalization_1/cond/Switch_1"], data_format="NCHW", epsilon=0.001, is_training=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](batch_normalization_1/FusedBatchNorm-0-TransposeNHWCToNCHW-LayoutOptimizer, batch_normalization_1/gamma/read, batch_normalization_1/beta/read, batch_normalization_1/Const_4, batch_normalization_1/Const_4)]]
[[{{node batch_normalization_1/cond/Merge/_949}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_133_batch_normalization_1/cond/Merge", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
During handling of the above exception, another exception occurred:
InternalError Traceback (most recent call last)
<ipython-input-11-748dc132eac2> in <module>()
4 pred = debug.predict([Q1_test_debug, Q2_test_debug], verbose=1, batch_size=1)
5 print(pred[0].shape)
----> 6 pred = debug2.predict([Q1_test_debug, Q2_test_debug], verbose=1, batch_size=1)
7 print(pred[0].shape)
~/.local/lib/python3.5/site-packages/keras/engine/training.py in predict(self, x, batch_size, verbose, steps)
1833 f = self.predict_function
1834 return self._predict_loop(f, ins, batch_size=batch_size,
-> 1835 verbose=verbose, steps=steps)
1836
1837 def train_on_batch(self, x, y,
~/.local/lib/python3.5/site-packages/keras/engine/training.py in _predict_loop(self, f, ins, batch_size, verbose, steps)
1329 ins_batch[i] = ins_batch[i].toarray()
1330
-> 1331 batch_outs = f(ins_batch)
1332 if not isinstance(batch_outs, list):
1333 batch_outs = [batch_outs]
~/.local/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py in __call__(self, inputs)
2480 session = get_session()
2481 updated = session.run(fetches=fetches, feed_dict=feed_dict,
-> 2482 **self.session_kwargs)
2483 return updated[:len(self.outputs)]
2484
~/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py in run(self, fetches, feed_dict, options, run_metadata)
885 try:
886 result = self._run(None, fetches, feed_dict, options_ptr,
--> 887 run_metadata_ptr)
888 if run_metadata:
889 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)
~/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
1108 if final_fetches or final_targets or (handle and feed_dict_tensor):
1109 results = self._do_run(handle, final_targets, final_fetches,
-> 1110 feed_dict_tensor, options, run_metadata)
1111 else:
1112 results = []
~/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
1284 if handle is None:
1285 return self._do_call(_run_fn, feeds, fetches, targets, options,
-> 1286 run_metadata)
1287 else:
1288 return self._do_call(_prun_fn, handle, feeds, fetches)
~/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
1306 self._config.experimental.client_handles_error_formatting):
1307 message = error_interpolation.interpolate(message, self._graph)
-> 1308 raise type(e)(node_def, op, message)
1309
1310 def _extend_graph(self):
InternalError: cuDNN launch failure : input shape ([1,600,25,25])
[[{{node batch_normalization_1/FusedBatchNorm}} = FusedBatchNorm[T=DT_FLOAT, _class=["loc:@batch_normalization_1/cond/Switch_1"], data_format="NCHW", epsilon=0.001, is_training=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](batch_normalization_1/FusedBatchNorm-0-TransposeNHWCToNCHW-LayoutOptimizer, batch_normalization_1/gamma/read, batch_normalization_1/beta/read, batch_normalization_1/Const_4, batch_normalization_1/Const_4)]]
[[{{node batch_normalization_1/cond/Merge/_949}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_133_batch_normalization_1/cond/Merge", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Caused by op 'batch_normalization_1/FusedBatchNorm', defined at:
File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/remondn/.local/lib/python3.5/site-packages/ipykernel_launcher.py", line 16, in <module>
app.launch_new_instance()
File "/home/remondn/.local/lib/python3.5/site-packages/traitlets/config/application.py", line 658, in launch_instance
app.start()
File "/home/remondn/.local/lib/python3.5/site-packages/ipykernel/kernelapp.py", line 497, in start
self.io_loop.start()
File "/home/remondn/.local/lib/python3.5/site-packages/tornado/platform/asyncio.py", line 132, in start
self.asyncio_loop.run_forever()
File "/usr/lib/python3.5/asyncio/base_events.py", line 345, in run_forever
self._run_once()
File "/usr/lib/python3.5/asyncio/base_events.py", line 1312, in _run_once
handle._run()
File "/usr/lib/python3.5/asyncio/events.py", line 125, in _run
self._callback(*self._args)
File "/home/remondn/.local/lib/python3.5/site-packages/tornado/platform/asyncio.py", line 122, in _handle_events
handler_func(fileobj, events)
File "/home/remondn/.local/lib/python3.5/site-packages/tornado/stack_context.py", line 300, in null_wrapper
return fn(*args, **kwargs)
File "/home/remondn/.local/lib/python3.5/site-packages/zmq/eventloop/zmqstream.py", line 450, in _handle_events
self._handle_recv()
File "/home/remondn/.local/lib/python3.5/site-packages/zmq/eventloop/zmqstream.py", line 480, in _handle_recv
self._run_callback(callback, msg)
File "/home/remondn/.local/lib/python3.5/site-packages/zmq/eventloop/zmqstream.py", line 432, in _run_callback
callback(*args, **kwargs)
File "/home/remondn/.local/lib/python3.5/site-packages/tornado/stack_context.py", line 300, in null_wrapper
return fn(*args, **kwargs)
File "/home/remondn/.local/lib/python3.5/site-packages/ipykernel/kernelbase.py", line 283, in dispatcher
return self.dispatch_shell(stream, msg)
File "/home/remondn/.local/lib/python3.5/site-packages/ipykernel/kernelbase.py", line 233, in dispatch_shell
handler(stream, idents, msg)
File "/home/remondn/.local/lib/python3.5/site-packages/ipykernel/kernelbase.py", line 399, in execute_request
user_expressions, allow_stdin)
File "/home/remondn/.local/lib/python3.5/site-packages/ipykernel/ipkernel.py", line 208, in do_execute
res = shell.run_cell(code, store_history=store_history, silent=silent)
File "/home/remondn/.local/lib/python3.5/site-packages/ipykernel/zmqshell.py", line 537, in run_cell
return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
File "/home/remondn/.local/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2662, in run_cell
raw_cell, store_history, silent, shell_futures)
File "/home/remondn/.local/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2785, in _run_cell
interactivity=interactivity, compiler=compiler, result=result)
File "/home/remondn/.local/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2901, in run_ast_nodes
if self.run_code(code, result):
File "/home/remondn/.local/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2961, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-10-44a967130b40>", line 87, in <module>
bn = BatchNormalization(axis=3)(grid)
File "/home/remondn/.local/lib/python3.5/site-packages/keras/engine/topology.py", line 619, in __call__
output = self.call(inputs, **kwargs)
File "/home/remondn/.local/lib/python3.5/site-packages/keras/layers/normalization.py", line 181, in call
epsilon=self.epsilon)
File "/home/remondn/.local/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 1831, in normalize_batch_in_training
epsilon=epsilon)
File "/home/remondn/.local/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 1806, in _fused_normalize_batch_in_training
data_format=tf_data_format)
File "/home/remondn/.local/lib/python3.5/site-packages/tensorflow/python/ops/nn_impl.py", line 909, in fused_batch_norm
name=name)
File "/home/remondn/.local/lib/python3.5/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 3466, in _fused_batch_norm
is_training=is_training, name=name)
File "/home/remondn/.local/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/remondn/.local/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/home/remondn/.local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3272, in create_op
op_def=op_def)
File "/home/remondn/.local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1768, in __init__
self._traceback = tf_stack.extract_stack()
InternalError (see above for traceback): cuDNN launch failure : input shape ([1,600,25,25])
[[{{node batch_normalization_1/FusedBatchNorm}} = FusedBatchNorm[T=DT_FLOAT, _class=["loc:@batch_normalization_1/cond/Switch_1"], data_format="NCHW", epsilon=0.001, is_training=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](batch_normalization_1/FusedBatchNorm-0-TransposeNHWCToNCHW-LayoutOptimizer, batch_normalization_1/gamma/read, batch_normalization_1/beta/read, batch_normalization_1/Const_4, batch_normalization_1/Const_4)]]
[[{{node batch_normalization_1/cond/Merge/_949}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_133_batch_normalization_1/cond/Merge", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
有几点不明白:
- 我们可以看到(
(25, 25, 600)
),BatchNormalization上一层的输出/输入的格式是channels_last
的格式。但是报错input shape ([1,600,25,25])
,格式是channels_first
。怎么突然变了?
- 我在 BatchNormalization 层的声明中指定了
axis = 3
,但在错误中,我们有 FusedBatchNorm [...] data_format="NCHW"
,表示 channels_first
格式。无论我选择哪个轴(我试过 1、2、0、-1),我总是有这个 data_format 的错误。当我改变 BatchNormalization 的轴时它没有改变什么
有人知道如何解决这个问题吗?
事实证明,我使用的库版本搞砸了。
我不知道为什么,但其他一切正常(实际上,删除 BatchNormalization 层导致网络正常工作...)
无论如何,我更新了我的包以将 CUDA 9.0 与 cuDNN 7.0.5 和 tensorflow-gpu 1.10.0 结合使用
我用来在所有这些之间获取匹配版本的链接:
- Tensorflow-gpu versions
- List of cuDNN versions depending on CUDA version(需要一个 nvidia 开发者帐户)
我加入这个帖子是因为我遇到了类似的错误。事实证明它与我的新硬件有关,对于图书馆来说太新了。
因此,对于 2080 RTX Ti,我可以通过以下配置消除错误:
Cuda 10.0(与其架构兼容)
CuDNN 7.4.1.5
tensorflow 1.13(当时release candidate,我用的是"pip3 install tf-nightly-gpu",支持cuda 10.0的版本)
我在代码中添加了以下内容(参见https://github.com/tensorflow/tensorflow/issues/24496):
from keras import backend as K
config = K.tf.ConfigProto()
config.gpu_options.allow_growth = True
希望对其他人有所帮助。
我也遇到了同样的问题,后来发现是内存不足。我的模型太大了。当我减少batch size
时,问题就解决了。
我的 Keras 模型(使用 Tensorflow)的 BatchNormalization 层不起作用,return 在训练时出现 InternalError 异常。
这是在我的模型中定义 BatchNormalization 层的行:
bn = BatchNormalization(axis=3)(grid)
我创建了 2 个模型(1 个之前,1 个之后)以调试模型:
debug = Model(inputs=[question1, question2], outputs=grid)
debug.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
bn = BatchNormalization(axis=3)(grid)
debug2 = Model(inputs=[question1, question2], outputs=bn)
debug2.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
然后我预测一些随机数据,只是为了实际预测任何东西:
pred = debug.predict([Q1_test_debug, Q2_test_debug], verbose=1, batch_size=1)
print(pred[0].shape)
pred = debug2.predict([Q1_test_debug, Q2_test_debug], verbose=1, batch_size=1)
print(pred[0].shape)
结果是:
(2, 25)
2/2 [==============================] - 2s 1s/step
(25, 25, 600)
---------------------------------------------------------------------------
InternalError Traceback (most recent call last)
~/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
1291 try:
-> 1292 return fn(*args)
1293 except errors.OpError as e:
~/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py in _run_fn(feed_dict, fetch_list, target_list, options, run_metadata)
1276 return self._call_tf_sessionrun(
-> 1277 options, feed_dict, fetch_list, target_list, run_metadata)
1278
~/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py in _call_tf_sessionrun(self, options, feed_dict, fetch_list, target_list, run_metadata)
1366 self._session, options, feed_dict, fetch_list, target_list,
-> 1367 run_metadata)
1368
InternalError: cuDNN launch failure : input shape ([1,600,25,25])
[[{{node batch_normalization_1/FusedBatchNorm}} = FusedBatchNorm[T=DT_FLOAT, _class=["loc:@batch_normalization_1/cond/Switch_1"], data_format="NCHW", epsilon=0.001, is_training=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](batch_normalization_1/FusedBatchNorm-0-TransposeNHWCToNCHW-LayoutOptimizer, batch_normalization_1/gamma/read, batch_normalization_1/beta/read, batch_normalization_1/Const_4, batch_normalization_1/Const_4)]]
[[{{node batch_normalization_1/cond/Merge/_949}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_133_batch_normalization_1/cond/Merge", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
During handling of the above exception, another exception occurred:
InternalError Traceback (most recent call last)
<ipython-input-11-748dc132eac2> in <module>()
4 pred = debug.predict([Q1_test_debug, Q2_test_debug], verbose=1, batch_size=1)
5 print(pred[0].shape)
----> 6 pred = debug2.predict([Q1_test_debug, Q2_test_debug], verbose=1, batch_size=1)
7 print(pred[0].shape)
~/.local/lib/python3.5/site-packages/keras/engine/training.py in predict(self, x, batch_size, verbose, steps)
1833 f = self.predict_function
1834 return self._predict_loop(f, ins, batch_size=batch_size,
-> 1835 verbose=verbose, steps=steps)
1836
1837 def train_on_batch(self, x, y,
~/.local/lib/python3.5/site-packages/keras/engine/training.py in _predict_loop(self, f, ins, batch_size, verbose, steps)
1329 ins_batch[i] = ins_batch[i].toarray()
1330
-> 1331 batch_outs = f(ins_batch)
1332 if not isinstance(batch_outs, list):
1333 batch_outs = [batch_outs]
~/.local/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py in __call__(self, inputs)
2480 session = get_session()
2481 updated = session.run(fetches=fetches, feed_dict=feed_dict,
-> 2482 **self.session_kwargs)
2483 return updated[:len(self.outputs)]
2484
~/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py in run(self, fetches, feed_dict, options, run_metadata)
885 try:
886 result = self._run(None, fetches, feed_dict, options_ptr,
--> 887 run_metadata_ptr)
888 if run_metadata:
889 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)
~/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
1108 if final_fetches or final_targets or (handle and feed_dict_tensor):
1109 results = self._do_run(handle, final_targets, final_fetches,
-> 1110 feed_dict_tensor, options, run_metadata)
1111 else:
1112 results = []
~/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
1284 if handle is None:
1285 return self._do_call(_run_fn, feeds, fetches, targets, options,
-> 1286 run_metadata)
1287 else:
1288 return self._do_call(_prun_fn, handle, feeds, fetches)
~/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
1306 self._config.experimental.client_handles_error_formatting):
1307 message = error_interpolation.interpolate(message, self._graph)
-> 1308 raise type(e)(node_def, op, message)
1309
1310 def _extend_graph(self):
InternalError: cuDNN launch failure : input shape ([1,600,25,25])
[[{{node batch_normalization_1/FusedBatchNorm}} = FusedBatchNorm[T=DT_FLOAT, _class=["loc:@batch_normalization_1/cond/Switch_1"], data_format="NCHW", epsilon=0.001, is_training=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](batch_normalization_1/FusedBatchNorm-0-TransposeNHWCToNCHW-LayoutOptimizer, batch_normalization_1/gamma/read, batch_normalization_1/beta/read, batch_normalization_1/Const_4, batch_normalization_1/Const_4)]]
[[{{node batch_normalization_1/cond/Merge/_949}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_133_batch_normalization_1/cond/Merge", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Caused by op 'batch_normalization_1/FusedBatchNorm', defined at:
File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/remondn/.local/lib/python3.5/site-packages/ipykernel_launcher.py", line 16, in <module>
app.launch_new_instance()
File "/home/remondn/.local/lib/python3.5/site-packages/traitlets/config/application.py", line 658, in launch_instance
app.start()
File "/home/remondn/.local/lib/python3.5/site-packages/ipykernel/kernelapp.py", line 497, in start
self.io_loop.start()
File "/home/remondn/.local/lib/python3.5/site-packages/tornado/platform/asyncio.py", line 132, in start
self.asyncio_loop.run_forever()
File "/usr/lib/python3.5/asyncio/base_events.py", line 345, in run_forever
self._run_once()
File "/usr/lib/python3.5/asyncio/base_events.py", line 1312, in _run_once
handle._run()
File "/usr/lib/python3.5/asyncio/events.py", line 125, in _run
self._callback(*self._args)
File "/home/remondn/.local/lib/python3.5/site-packages/tornado/platform/asyncio.py", line 122, in _handle_events
handler_func(fileobj, events)
File "/home/remondn/.local/lib/python3.5/site-packages/tornado/stack_context.py", line 300, in null_wrapper
return fn(*args, **kwargs)
File "/home/remondn/.local/lib/python3.5/site-packages/zmq/eventloop/zmqstream.py", line 450, in _handle_events
self._handle_recv()
File "/home/remondn/.local/lib/python3.5/site-packages/zmq/eventloop/zmqstream.py", line 480, in _handle_recv
self._run_callback(callback, msg)
File "/home/remondn/.local/lib/python3.5/site-packages/zmq/eventloop/zmqstream.py", line 432, in _run_callback
callback(*args, **kwargs)
File "/home/remondn/.local/lib/python3.5/site-packages/tornado/stack_context.py", line 300, in null_wrapper
return fn(*args, **kwargs)
File "/home/remondn/.local/lib/python3.5/site-packages/ipykernel/kernelbase.py", line 283, in dispatcher
return self.dispatch_shell(stream, msg)
File "/home/remondn/.local/lib/python3.5/site-packages/ipykernel/kernelbase.py", line 233, in dispatch_shell
handler(stream, idents, msg)
File "/home/remondn/.local/lib/python3.5/site-packages/ipykernel/kernelbase.py", line 399, in execute_request
user_expressions, allow_stdin)
File "/home/remondn/.local/lib/python3.5/site-packages/ipykernel/ipkernel.py", line 208, in do_execute
res = shell.run_cell(code, store_history=store_history, silent=silent)
File "/home/remondn/.local/lib/python3.5/site-packages/ipykernel/zmqshell.py", line 537, in run_cell
return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
File "/home/remondn/.local/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2662, in run_cell
raw_cell, store_history, silent, shell_futures)
File "/home/remondn/.local/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2785, in _run_cell
interactivity=interactivity, compiler=compiler, result=result)
File "/home/remondn/.local/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2901, in run_ast_nodes
if self.run_code(code, result):
File "/home/remondn/.local/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2961, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-10-44a967130b40>", line 87, in <module>
bn = BatchNormalization(axis=3)(grid)
File "/home/remondn/.local/lib/python3.5/site-packages/keras/engine/topology.py", line 619, in __call__
output = self.call(inputs, **kwargs)
File "/home/remondn/.local/lib/python3.5/site-packages/keras/layers/normalization.py", line 181, in call
epsilon=self.epsilon)
File "/home/remondn/.local/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 1831, in normalize_batch_in_training
epsilon=epsilon)
File "/home/remondn/.local/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 1806, in _fused_normalize_batch_in_training
data_format=tf_data_format)
File "/home/remondn/.local/lib/python3.5/site-packages/tensorflow/python/ops/nn_impl.py", line 909, in fused_batch_norm
name=name)
File "/home/remondn/.local/lib/python3.5/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 3466, in _fused_batch_norm
is_training=is_training, name=name)
File "/home/remondn/.local/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/remondn/.local/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/home/remondn/.local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3272, in create_op
op_def=op_def)
File "/home/remondn/.local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1768, in __init__
self._traceback = tf_stack.extract_stack()
InternalError (see above for traceback): cuDNN launch failure : input shape ([1,600,25,25])
[[{{node batch_normalization_1/FusedBatchNorm}} = FusedBatchNorm[T=DT_FLOAT, _class=["loc:@batch_normalization_1/cond/Switch_1"], data_format="NCHW", epsilon=0.001, is_training=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](batch_normalization_1/FusedBatchNorm-0-TransposeNHWCToNCHW-LayoutOptimizer, batch_normalization_1/gamma/read, batch_normalization_1/beta/read, batch_normalization_1/Const_4, batch_normalization_1/Const_4)]]
[[{{node batch_normalization_1/cond/Merge/_949}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_133_batch_normalization_1/cond/Merge", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
有几点不明白:
- 我们可以看到(
(25, 25, 600)
),BatchNormalization上一层的输出/输入的格式是channels_last
的格式。但是报错input shape ([1,600,25,25])
,格式是channels_first
。怎么突然变了? - 我在 BatchNormalization 层的声明中指定了
axis = 3
,但在错误中,我们有FusedBatchNorm [...] data_format="NCHW"
,表示channels_first
格式。无论我选择哪个轴(我试过 1、2、0、-1),我总是有这个 data_format 的错误。当我改变 BatchNormalization 的轴时它没有改变什么
有人知道如何解决这个问题吗?
事实证明,我使用的库版本搞砸了。
我不知道为什么,但其他一切正常(实际上,删除 BatchNormalization 层导致网络正常工作...)
无论如何,我更新了我的包以将 CUDA 9.0 与 cuDNN 7.0.5 和 tensorflow-gpu 1.10.0 结合使用
我用来在所有这些之间获取匹配版本的链接:
- Tensorflow-gpu versions
- List of cuDNN versions depending on CUDA version(需要一个 nvidia 开发者帐户)
我加入这个帖子是因为我遇到了类似的错误。事实证明它与我的新硬件有关,对于图书馆来说太新了。 因此,对于 2080 RTX Ti,我可以通过以下配置消除错误:
Cuda 10.0(与其架构兼容)
CuDNN 7.4.1.5
tensorflow 1.13(当时release candidate,我用的是"pip3 install tf-nightly-gpu",支持cuda 10.0的版本)
我在代码中添加了以下内容(参见https://github.com/tensorflow/tensorflow/issues/24496):
from keras import backend as K
config = K.tf.ConfigProto()
config.gpu_options.allow_growth = True
希望对其他人有所帮助。
我也遇到了同样的问题,后来发现是内存不足。我的模型太大了。当我减少batch size
时,问题就解决了。