"Resource exhausted: OOM when allocating tensor" GPT 2 模型再训练期间:
"Resource exhausted: OOM when allocating tensor" during Retraining of GPT 2 Model:
我正在使用 Friends Dialogues 作为数据集来使用 GPT-2 训练对话式 AI,但是,它显示我内存不足。我知道这个问题已经在 Whosebug 上解决了,但我不知道如何优化 NLP 任务。
我试过将批量大小设置为 50(我的数据集有大约 60k 行)。我一直在关注这个 tutorial 关于在自定义数据集上重新训练 GPT-2。
我的系统规格是:
OS: Windows 10
内存:16GB
CPU: i7 第 8 代
GPU:4GB Nvidia GTX 1050Ti
这是完整的错误消息
Resource exhausted: OOM when allocating tensor with shape[51200,2304] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
File "C:\Users\bhave\AppData\Local\conda\conda\envs\tf_gpu\lib\site-packages\tensorflow\python\client\session.py", line 1334, in _do_call
return fn(*args)
File "C:\Users\bhave\AppData\Local\conda\conda\envs\tf_gpu\lib\site-packages\tensorflow\python\client\session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "C:\Users\bhave\AppData\Local\conda\conda\envs\tf_gpu\lib\site-packages\tensorflow\python\client\session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[51200,2304] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node model/h0/attn/c_attn/MatMul}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[{{node Mean}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train.py", line 293, in <module>
main()
File "train.py", line 271, in main
feed_dict={context: sample_batch()})
File "C:\Users\bhave\AppData\Local\conda\conda\envs\tf_gpu\lib\site-packages\tensorflow\python\client\session.py", line 929, in run
run_metadata_ptr)
File "C:\Users\bhave\AppData\Local\conda\conda\envs\tf_gpu\lib\site-packages\tensorflow\python\client\session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "C:\Users\bhave\AppData\Local\conda\conda\envs\tf_gpu\lib\site-packages\tensorflow\python\client\session.py", line 1328, in _do_run
run_metadata)
File "C:\Users\bhave\AppData\Local\conda\conda\envs\tf_gpu\lib\site-packages\tensorflow\python\client\session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[51200,2304] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node model/h0/attn/c_attn/MatMul (defined at D:\Python and AI\Generative Chatbot\gpt-2\src\model.py:55) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[node Mean (defined at train.py:96) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Caused by op 'model/h0/attn/c_attn/MatMul', defined at:
File "train.py", line 293, in <module>
main()
File "train.py", line 93, in main
output = model.model(hparams=hparams, X=context_in)
File "D:\Python and AI\Generative Chatbot\gpt-2\src\model.py", line 164, in model
h, present = block(h, 'h%d' % layer, past=past, hparams=hparams)
File "D:\Python and AI\Generative Chatbot\gpt-2\src\model.py", line 126, in block
a, present = attn(norm(x, 'ln_1'), 'attn', nx, past=past, hparams=hparams)
File "D:\Python and AI\Generative Chatbot\gpt-2\src\model.py", line 102, in attn
c = conv1d(x, 'c_attn', n_state*3)
File "D:\Python and AI\Generative Chatbot\gpt-2\src\model.py", line 55, in conv1d
c = tf.reshape(tf.matmul(tf.reshape(x, [-1, nx]), tf.reshape(w, [-1, nf]))+b, start+[nf])
File "C:\Users\bhave\AppData\Local\conda\conda\envs\tf_gpu\lib\site-packages\tensorflow\python\ops\math_ops.py", line 2455, in matmul
a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
File "C:\Users\bhave\AppData\Local\conda\conda\envs\tf_gpu\lib\site-packages\tensorflow\python\ops\gen_math_ops.py", line 5333, in mat_mul
name=name)
File "C:\Users\bhave\AppData\Local\conda\conda\envs\tf_gpu\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "C:\Users\bhave\AppData\Local\conda\conda\envs\tf_gpu\lib\site-packages\tensorflow\python\util\deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "C:\Users\bhave\AppData\Local\conda\conda\envs\tf_gpu\lib\site-packages\tensorflow\python\framework\ops.py", line 3300, in create_op
op_def=op_def)
File "C:\Users\bhave\AppData\Local\conda\conda\envs\tf_gpu\lib\site-packages\tensorflow\python\framework\ops.py", line 1801, in __init__
self._traceback = tf_stack.extract_stack()
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[51200,2304] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator
GPU_0_bfc
[[node model/h0/attn/c_attn/MatMul (defined at D:\Python and AI\Generative Chatbot\gpt-2\src\model.py:55) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[node Mean (defined at train.py:96) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
我猜它的意思总是一样的。尝试将批量大小设置为 1,看看是否可行。然后提高批量大小以了解您的 gpu 可以处理多少。如果它无法处理 1 的批量大小,则该模型对于您的 GPU 来说可能太大了。如果你没有立即得到这个错误,检查代码是否没问题,也许里面有一些错误。哦,也许你应该检查还有什么在使用你的 gpu,只是为了确保没有不必要的东西占用资源。
我已经关注 the same tutorial 并且 运行 遇到了同样的 OOM(内存不足)问题。当使用 CPU 进行训练时,虽然一切正常,但速度很慢。所以 Python 中的代码和设置工作正常,只是显卡上的 VRAM 太小了。
如果您在 GPU 上训练时遇到此问题并想测试它在 CPU 上的工作方式,您可以通过更改 [=26 中的以下代码行来禁用 tensorflows 对 GPU 的访问=]train.py 来自:
config = tf.ConfigProto()
...至:
config = tf.ConfigProto(device_count = {'GPU': 0})
如果您想升级语法以避免在控制台中出现讨厌的警告,您可以使用新语法:
config = tf.compat.v1.ConfigProto(device_count = {'GPU': 0})
这将阻止 tensorflow 使用 GPU 并改为在 CPU 上进行所有训练。如果您计算机中的 RAM 多于显卡上的 VRAM,这可能会解决 OOM 问题。
我有一个带有 11 Gb VRAM 的 GTX 1080ti,这对于 Pascal 一代的图形卡来说已经足够了。但是我已经从项目原来的小型原始模型(117M)切换到中型模型(355M)。这将影响 运行 训练所需的内存量。我将批量大小设置为 1 并不重要 - 它仍然太多了,我的 GPU 无法处理。
我正在使用 Friends Dialogues 作为数据集来使用 GPT-2 训练对话式 AI,但是,它显示我内存不足。我知道这个问题已经在 Whosebug 上解决了,但我不知道如何优化 NLP 任务。
我试过将批量大小设置为 50(我的数据集有大约 60k 行)。我一直在关注这个 tutorial 关于在自定义数据集上重新训练 GPT-2。
我的系统规格是: OS: Windows 10 内存:16GB CPU: i7 第 8 代 GPU:4GB Nvidia GTX 1050Ti
这是完整的错误消息
Resource exhausted: OOM when allocating tensor with shape[51200,2304] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
File "C:\Users\bhave\AppData\Local\conda\conda\envs\tf_gpu\lib\site-packages\tensorflow\python\client\session.py", line 1334, in _do_call
return fn(*args)
File "C:\Users\bhave\AppData\Local\conda\conda\envs\tf_gpu\lib\site-packages\tensorflow\python\client\session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "C:\Users\bhave\AppData\Local\conda\conda\envs\tf_gpu\lib\site-packages\tensorflow\python\client\session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[51200,2304] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node model/h0/attn/c_attn/MatMul}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[{{node Mean}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train.py", line 293, in <module>
main()
File "train.py", line 271, in main
feed_dict={context: sample_batch()})
File "C:\Users\bhave\AppData\Local\conda\conda\envs\tf_gpu\lib\site-packages\tensorflow\python\client\session.py", line 929, in run
run_metadata_ptr)
File "C:\Users\bhave\AppData\Local\conda\conda\envs\tf_gpu\lib\site-packages\tensorflow\python\client\session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "C:\Users\bhave\AppData\Local\conda\conda\envs\tf_gpu\lib\site-packages\tensorflow\python\client\session.py", line 1328, in _do_run
run_metadata)
File "C:\Users\bhave\AppData\Local\conda\conda\envs\tf_gpu\lib\site-packages\tensorflow\python\client\session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[51200,2304] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node model/h0/attn/c_attn/MatMul (defined at D:\Python and AI\Generative Chatbot\gpt-2\src\model.py:55) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[node Mean (defined at train.py:96) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Caused by op 'model/h0/attn/c_attn/MatMul', defined at:
File "train.py", line 293, in <module>
main()
File "train.py", line 93, in main
output = model.model(hparams=hparams, X=context_in)
File "D:\Python and AI\Generative Chatbot\gpt-2\src\model.py", line 164, in model
h, present = block(h, 'h%d' % layer, past=past, hparams=hparams)
File "D:\Python and AI\Generative Chatbot\gpt-2\src\model.py", line 126, in block
a, present = attn(norm(x, 'ln_1'), 'attn', nx, past=past, hparams=hparams)
File "D:\Python and AI\Generative Chatbot\gpt-2\src\model.py", line 102, in attn
c = conv1d(x, 'c_attn', n_state*3)
File "D:\Python and AI\Generative Chatbot\gpt-2\src\model.py", line 55, in conv1d
c = tf.reshape(tf.matmul(tf.reshape(x, [-1, nx]), tf.reshape(w, [-1, nf]))+b, start+[nf])
File "C:\Users\bhave\AppData\Local\conda\conda\envs\tf_gpu\lib\site-packages\tensorflow\python\ops\math_ops.py", line 2455, in matmul
a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
File "C:\Users\bhave\AppData\Local\conda\conda\envs\tf_gpu\lib\site-packages\tensorflow\python\ops\gen_math_ops.py", line 5333, in mat_mul
name=name)
File "C:\Users\bhave\AppData\Local\conda\conda\envs\tf_gpu\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "C:\Users\bhave\AppData\Local\conda\conda\envs\tf_gpu\lib\site-packages\tensorflow\python\util\deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "C:\Users\bhave\AppData\Local\conda\conda\envs\tf_gpu\lib\site-packages\tensorflow\python\framework\ops.py", line 3300, in create_op
op_def=op_def)
File "C:\Users\bhave\AppData\Local\conda\conda\envs\tf_gpu\lib\site-packages\tensorflow\python\framework\ops.py", line 1801, in __init__
self._traceback = tf_stack.extract_stack()
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[51200,2304] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator
GPU_0_bfc
[[node model/h0/attn/c_attn/MatMul (defined at D:\Python and AI\Generative Chatbot\gpt-2\src\model.py:55) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[node Mean (defined at train.py:96) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
我猜它的意思总是一样的。尝试将批量大小设置为 1,看看是否可行。然后提高批量大小以了解您的 gpu 可以处理多少。如果它无法处理 1 的批量大小,则该模型对于您的 GPU 来说可能太大了。如果你没有立即得到这个错误,检查代码是否没问题,也许里面有一些错误。哦,也许你应该检查还有什么在使用你的 gpu,只是为了确保没有不必要的东西占用资源。
我已经关注 the same tutorial 并且 运行 遇到了同样的 OOM(内存不足)问题。当使用 CPU 进行训练时,虽然一切正常,但速度很慢。所以 Python 中的代码和设置工作正常,只是显卡上的 VRAM 太小了。
如果您在 GPU 上训练时遇到此问题并想测试它在 CPU 上的工作方式,您可以通过更改 [=26 中的以下代码行来禁用 tensorflows 对 GPU 的访问=]train.py 来自:
config = tf.ConfigProto()
...至:
config = tf.ConfigProto(device_count = {'GPU': 0})
如果您想升级语法以避免在控制台中出现讨厌的警告,您可以使用新语法:
config = tf.compat.v1.ConfigProto(device_count = {'GPU': 0})
这将阻止 tensorflow 使用 GPU 并改为在 CPU 上进行所有训练。如果您计算机中的 RAM 多于显卡上的 VRAM,这可能会解决 OOM 问题。
我有一个带有 11 Gb VRAM 的 GTX 1080ti,这对于 Pascal 一代的图形卡来说已经足够了。但是我已经从项目原来的小型原始模型(117M)切换到中型模型(355M)。这将影响 运行 训练所需的内存量。我将批量大小设置为 1 并不重要 - 它仍然太多了,我的 GPU 无法处理。