ResourceExhaustedError: OOM when allocating tensor

Question

我自己实现了 AlexNet，少了一个全连接层来对 102 类朵花进行分类。我的训练集包含 11,000 张图像，而验证集和训练集各有 3,000 张图像。我把这三个数据集写成HDF5格式，存到磁盘上。我重新加载它们并尝试使用 8 和 75 个纪元的批次通过网络传递图像。但是，发生内存错误

我已经尝试将批量大小减小到 8 并将尺寸减小到 400x400（原来是 500x500）但没有用

tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 2019-08-23 00:19:47.336560: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: name: GeForce GTX 1050 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.62 pciBusID: 0000:01:00.0 totalMemory: 4.00GiB freeMemory: 3.30GiB 2019-08-23 00:19:47.342432: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0 2019-08-23 00:19:47.900540: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-08-23 00:19:47.904687: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 2019-08-23 00:19:47.907033: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N 2019-08-23 00:19:47.909380: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3007 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1) 2019-08-23 00:19:48.550001: W tensorflow/core/framework/allocator.cc:124] Allocation of 822083584 exceeds 10% of system memory. 2019-08-23 00:19:49.089904: W tensorflow/core/framework/allocator.cc:124] Allocation of 822083584 exceeds 10% of system memory. 2019-08-23 00:19:49.629533: W tensorflow/core/framework/allocator.cc:124] Allocation of 822083584 exceeds 10% of system memory. 2019-08-23 00:19:50.067994: W tensorflow/core/framework/allocator.cc:124] Allocation of 822083584 exceeds 10% of system memory. 2019-08-23 00:19:50.523258: W tensorflow/core/framework/allocator.cc:124] Allocation of 822083584 exceeds 10% of system memory. Epoch 1/75 2019-08-23 00:20:14.632764: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library cublas64_100.dll locally 2019-08-23 00:20:16.325917: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.14GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-08-23 00:20:16.410374: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 836.38MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-08-23 00:20:16.650565: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 429.27MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-08-23 00:20:16.716695: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.22GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-08-23 00:20:16.733003: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 637.52MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-08-23 00:20:16.782250: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 844.88MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-08-23 00:20:16.792756: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 429.27MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-08-23 00:20:25.135977: W tensorflow/core/common_runtime/bfc_allocator.cc:267] Allocator (GPU_0_bfc) ran out of memory trying to allocate 784.00MiB. Current allocation summary follows. 2019-08-23 00:20:25.143913: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (256):
Total Chunks: 104, Chunks in use: 99. 26.0KiB allocated for chunks. 24.8KiB in use in bin. 452B client-requested in use in bin. 2019-08-23 00:20:25.150353: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (512):
Total Chunks: 16, Chunks in use: 14. 8.0KiB allocated for chunks. 7.0KiB in use in bin. 5.3KiB client-requested in use in bin. 2019-08-23 00:20:25.160812: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (1024): Total Chunks: 49, Chunks in use: 49. 61.3KiB allocated for chunks. 61.3KiB in use in bin. 60.1KiB client-requested in use in bin. 2019-08-23 00:20:25.169944: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (2048): Total Chunks: 4, Chunks in use: 4. 13.0KiB allocated for chunks. 13.0KiB in use in bin. 12.8KiB client-requested in use in bin. 2019-08-23 00:20:25.182025: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (4096): Total Chunks: 1, Chunks in use: 0. 6.3KiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2019-08-23 00:20:25.192454: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (8192): Total Chunks: 1, Chunks in use: 0. 15.0KiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2019-08-23 00:20:25.200847: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (16384):
Total Chunks: 9, Chunks in use: 9. 144.8KiB allocated for chunks. 144.8KiB in use in bin. 144.0KiB client-requested in use in bin. 2019-08-23 00:20:25.209817: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (32768):
Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2019-08-23 00:20:25.219192: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (65536):
Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2019-08-23 00:20:25.228194: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (131072):
Total Chunks: 9, Chunks in use: 9. 1.17MiB allocated for chunks. 1.17MiB in use in bin. 1.16MiB client-requested in use in bin. 2019-08-23 00:20:25.236088: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (262144):
Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2019-08-23 00:20:25.245435: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (524288):
Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2019-08-23 00:20:25.254114: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (1048576): Total Chunks: 8, Chunks in use: 7. 12.25MiB allocated for chunks. 11.22MiB in use in bin. 10.91MiB client-requested in use in bin. 2019-08-23 00:20:25.264209: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (2097152):
Total Chunks: 14, Chunks in use: 14. 42.09MiB allocated for chunks. 42.09MiB in use in bin. 42.09MiB client-requested in use in bin. 2019-08-23 00:20:25.273799: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (4194304):
Total Chunks: 13, Chunks in use: 13. 80.41MiB allocated for chunks. 80.41MiB in use in bin. 77.91MiB client-requested in use in bin. 2019-08-23 00:20:25.285089: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (8388608):
Total Chunks: 13, Chunks in use: 13. 141.14MiB allocated for chunks. 141.14MiB in use in bin. 136.45MiB client-requested in use in bin. 2019-08-23 00:20:25.298520: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (16777216):
Total Chunks: 4, Chunks in use: 4. 112.98MiB allocated for chunks. 112.98MiB in use in bin. 112.98MiB client-requested in use in bin. 2019-08-23 00:20:25.306979: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (33554432):
Total Chunks: 4, Chunks in use: 4. 183.11MiB allocated for chunks. 183.11MiB in use in bin. 183.11MiB client-requested in use in bin. 2019-08-23 00:20:25.315121: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (67108864):
Total Chunks: 1, Chunks in use: 0. 82.18MiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2019-08-23 00:20:25.322194: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (134217728): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2019-08-23 00:20:25.331550: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (268435456): Total Chunks: 3, Chunks in use: 3. 2.30GiB allocated for chunks. 2.30GiB in use in bin. 2.30GiB client-requested in use in bin. 2019-08-23 00:20:25.342419: I tensorflow/core/common_runtime/bfc_allocator.cc:613] Bin for 784.00MiB was 256.00MiB, Chunk State: tensorflow/core/common_runtime/bfc_allocator.cc:645] Sum Total of in-use chunks: 2.87GiB 2019-08-23 00:20:50.049508: I tensorflow/core/common_runtime/bfc_allocator.cc:647] Stats: Limit:
3153697177 InUse: 3086482944 MaxInUse:
3153574400 NumAllocs: 388 MaxAllocSize:
822083584

2019-08-23 00:20:50.061236: W tensorflow/core/common_runtime/bfc_allocator.cc:271] **************************************************************************************************__ 2019-08-23 00:20:50.066546: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at cwise_ops_common.cc:70 : Resource exhausted: OOM when allocating tensor with shape[50176,4096] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc Traceback (most recent call last): File "train.py", line 80, in max_queue_size=8 * 2, verbose=1) File "C:\Users\aleem\Anaconda3\envs\tensorflowf\lib\site-packages\tensorflow\python\keras\engine\training.py", line 1426, in fit_generator initial_epoch=initial_epoch) File "C:\Users\aleem\Anaconda3\envs\tensorflowf\lib\site-packages\tensorflow\python\keras\engine\training_generator.py", line 191, in model_iteration batch_outs = batch_function(*batch_data) File "C:\Users\aleem\Anaconda3\envs\tensorflowf\lib\site-packages\tensorflow\python\keras\engine\training.py", line 1191, in train_on_batch outputs = self._fit_function(ins) # pylint: disable=not-callable File "C:\Users\aleem\Anaconda3\envs\tensorflowf\lib\site-packages\tensorflow\python\keras\backend.py", line 3076, in call run_metadata=self.run_metadata) File "C:\Users\aleem\Anaconda3\envs\tensorflowf\lib\site-packages\tensorflow\python\client\session.py", line 1439, in call run_metadata_ptr) File "C:\Users\aleem\Anaconda3\envs\tensorflowf\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 528, in exit c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[50176,4096] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node training/RMSprop/gradients/loss/kernel/Regularizer_5/Square_grad/Mul_1}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
     [[{{node ConstantFoldingCtrl/loss/activation_6_loss/broadcast_weights/assert_broadcastable/AssertGuard/Switch_0}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Answer 1

这是因为 GPU 内存不能免费分配给训练，如果不是批量处理，这可能是由于内存中的数据集过载。但是你已经使用了 fit_generator 所以我们可以排除这种情况，因为它提供了用于批量训练的数据，同时运行并行生成数据。

解决方案是检查哪个进程正在使用您的 GPU。如果您使用的是 nvidia GPU，您可以通过 nvidia-smi 检查消耗 GPU 的进程，否则您也可以尝试 PS -fA | grep python。这将向您显示哪个进程正在运行正在消耗 GPU。只需从 PID 列中获取进程 ID，然后通过命令 kill -9 PID 终止进程。重新运行训练，这次你的GPU有空了。我遇到了同样的问题，清除 GPU 对我有帮助。

注意-所有命令在终端中都是运行。

ResourceExhaustedError: OOM when allocating tensor

ResourceExhaustedError: OOM when allocating tensor

python

gpu

computer-vision

h5py