Keras 多 GPU 示例给出 ResourceExhaustedError
Keras Multi GPU example gives ResourceExhaustedError
所以我尝试在 Keras 中使用多个 GPU。当我 运行 training_utils.py 使用示例程序时(在 training_utils.py 代码中作为注释给出),我最终得到 ResourceExhaustedError
。 nvidia-smi
告诉我四个 GPU 中只有一个在工作。使用一个 GPU 可以很好地用于其他程序。
- 张量流 1.3.0
- 凯拉斯 2.0.8
- Ubuntu 16.04
- CUDA/cuDNN 8.0/6.0
问题:有人知道这是怎么回事吗?
控制台输出:
(...)
2017-10-26 14:39:02.086838: W tensorflow/core/common_runtime/bfc_allocator.cc:277] ***************************************************************************************************x
2017-10-26 14:39:02.086857: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[128,55,55,256]
Traceback (most recent call last):
File "test.py", line 27, in
parallel_model.fit(x, y, epochs=20, batch_size=256)
File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/training.py", line 1631, in fit
validation_steps=validation_steps)
File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/training.py", line 1213, in _fit_loop
outs = f(ins_batch)
File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 2331, in call
**self.session_kwargs)
File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1124, in _run
feed_dict_tensor, options, run_metadata)
File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run
options, run_metadata)
File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[128,55,55,256]
[[Node: replica_1/xception/block3_sepconv2/separable_conv2d = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="VALID", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:1"](replica_1/xception/block3_sepconv2/separable_conv2d/depthwise, block3_sepconv2/pointwise_kernel/read/_2103)]]
[[Node: training/RMSprop/gradients/replica_0/xception/block10_sepconv2/separable_conv2d_grad/Conv2DBackpropFilter/_4511 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_25380_training/RMSprop/gradients/replica_0/xception/block10_sepconv2/separable_conv2d_grad/Conv2DBackpropFilter", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]]
Caused by op u'replica_1/xception/block3_sepconv2/separable_conv2d',
defined at: File "test.py", line 19, in
parallel_model = multi_gpu_model(model, gpus=2) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/utils/training_utils.py",
line 143, in multi_gpu_model
outputs = model(inputs) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/topology.py",
line 603, in call
output = self.call(inputs, **kwargs) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/topology.py",
line 2061, in call
output_tensors, _, _ = self.run_internal_graph(inputs, masks) File
"/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/topology.py",
line 2212, in run_internal_graph
output_tensors = _to_list(layer.call(computed_tensor, **kwargs)) File
"/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/layers/convolutional.py",
line 1221, in call
dilation_rate=self.dilation_rate) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py",
line 3279, in separable_conv2d
data_format=tf_data_format) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/nn_impl.py",
line 497, in separable_conv2d
name=name) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_nn_ops.py",
line 397, in conv2d
data_format=data_format, name=name) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py",
line 767, in apply_op
op_def=op_def) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py",
line 2630, in create_op
original_op=self._default_original_op, op_def=op_def) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py",
line 1204, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
ResourceExhaustedError (see above for traceback): OOM when allocating
tensor with shape[128,55,55,256] [[Node:
replica_1/xception/block3_sepconv2/separable_conv2d =
Conv2D[T=DT_FLOAT, data_format="NHWC", padding="VALID", strides=[1, 1,
1, 1], use_cudnn_on_gpu=true,
_device="/job:localhost/replica:0/task:0/gpu:1"](replica_1/xception/block3_sepconv2/separable_conv2d/depthwise,
block3_sepconv2/pointwise_kernel/read/_2103)]] [[Node:
training/RMSprop/gradients/replica_0/xception/block10_sepconv2/separable_conv2d_grad/Conv2DBackpropFilter/_4511
= _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0",
send_device="/job:localhost/replica:0/task:0/gpu:0",
send_device_incarnation=1,
tensor_name="edge_25380_training/RMSprop/gradients/replica_0/xception/block10_sepconv2/separable_conv2d_grad/Conv2DBackpropFilter",
tensor_type=DT_FLOAT,
_device="/job:localhost/replica:0/task:0/cpu:0"]]
编辑(添加示例代码):
import tensorflow as tf
from keras.applications import Xception
from keras.utils import multi_gpu_model
import numpy as np
num_samples = 1000
height = 224
width = 224
num_classes = 100
with tf.device('/cpu:0'):
model = Xception(weights=None,
input_shape=(height, width, 3),
classes=num_classes)
parallel_model = multi_gpu_model(model, gpus=4)
parallel_model.compile(loss='categorical_crossentropy',
optimizer='rmsprop')
x = np.random.random((num_samples, height, width, 3))
y = np.random.random((num_samples, num_classes))
parallel_model.fit(x, y, epochs=20, batch_size=128)
当在 GPU 上遇到 OOM/ResourceExhaustedError 时,我认为更改(减少)batch size
是最开始尝试的正确选择。
For different GPU you may need different batch size based on the GPU
memory you have.
最近我遇到了类似的问题,做了很多调整来做不同类型的实验。
这里是link到(也包括一些技巧)。
但是,在减小批次大小的同时,您可能会发现训练速度变慢了。
所以我尝试在 Keras 中使用多个 GPU。当我 运行 training_utils.py 使用示例程序时(在 training_utils.py 代码中作为注释给出),我最终得到 ResourceExhaustedError
。 nvidia-smi
告诉我四个 GPU 中只有一个在工作。使用一个 GPU 可以很好地用于其他程序。
- 张量流 1.3.0
- 凯拉斯 2.0.8
- Ubuntu 16.04
- CUDA/cuDNN 8.0/6.0
问题:有人知道这是怎么回事吗?
控制台输出:
(...)
2017-10-26 14:39:02.086838: W tensorflow/core/common_runtime/bfc_allocator.cc:277] ***************************************************************************************************x 2017-10-26 14:39:02.086857: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[128,55,55,256] Traceback (most recent call last): File "test.py", line 27, in parallel_model.fit(x, y, epochs=20, batch_size=256) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/training.py", line 1631, in fit validation_steps=validation_steps) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/training.py", line 1213, in _fit_loop outs = f(ins_batch) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 2331, in call **self.session_kwargs) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 895, in run run_metadata_ptr) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1124, in _run feed_dict_tensor, options, run_metadata) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run options, run_metadata) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[128,55,55,256] [[Node: replica_1/xception/block3_sepconv2/separable_conv2d = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="VALID", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:1"](replica_1/xception/block3_sepconv2/separable_conv2d/depthwise, block3_sepconv2/pointwise_kernel/read/_2103)]] [[Node: training/RMSprop/gradients/replica_0/xception/block10_sepconv2/separable_conv2d_grad/Conv2DBackpropFilter/_4511 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_25380_training/RMSprop/gradients/replica_0/xception/block10_sepconv2/separable_conv2d_grad/Conv2DBackpropFilter", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]]
Caused by op u'replica_1/xception/block3_sepconv2/separable_conv2d', defined at: File "test.py", line 19, in parallel_model = multi_gpu_model(model, gpus=2) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/utils/training_utils.py", line 143, in multi_gpu_model outputs = model(inputs) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/topology.py", line 603, in call output = self.call(inputs, **kwargs) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/topology.py", line 2061, in call output_tensors, _, _ = self.run_internal_graph(inputs, masks) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/topology.py", line 2212, in run_internal_graph output_tensors = _to_list(layer.call(computed_tensor, **kwargs)) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/layers/convolutional.py", line 1221, in call dilation_rate=self.dilation_rate) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 3279, in separable_conv2d data_format=tf_data_format) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/nn_impl.py", line 497, in separable_conv2d name=name) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 397, in conv2d data_format=data_format, name=name) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op op_def=op_def) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2630, in create_op original_op=self._default_original_op, op_def=op_def) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1204, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[128,55,55,256] [[Node: replica_1/xception/block3_sepconv2/separable_conv2d = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="VALID", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:1"](replica_1/xception/block3_sepconv2/separable_conv2d/depthwise, block3_sepconv2/pointwise_kernel/read/_2103)]] [[Node: training/RMSprop/gradients/replica_0/xception/block10_sepconv2/separable_conv2d_grad/Conv2DBackpropFilter/_4511 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_25380_training/RMSprop/gradients/replica_0/xception/block10_sepconv2/separable_conv2d_grad/Conv2DBackpropFilter", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]]
编辑(添加示例代码):
import tensorflow as tf
from keras.applications import Xception
from keras.utils import multi_gpu_model
import numpy as np
num_samples = 1000
height = 224
width = 224
num_classes = 100
with tf.device('/cpu:0'):
model = Xception(weights=None,
input_shape=(height, width, 3),
classes=num_classes)
parallel_model = multi_gpu_model(model, gpus=4)
parallel_model.compile(loss='categorical_crossentropy',
optimizer='rmsprop')
x = np.random.random((num_samples, height, width, 3))
y = np.random.random((num_samples, num_classes))
parallel_model.fit(x, y, epochs=20, batch_size=128)
当在 GPU 上遇到 OOM/ResourceExhaustedError 时,我认为更改(减少)batch size
是最开始尝试的正确选择。
For different GPU you may need different batch size based on the GPU memory you have.
最近我遇到了类似的问题,做了很多调整来做不同类型的实验。
这里是link到
但是,在减小批次大小的同时,您可能会发现训练速度变慢了。