调用 eval 两次时数据集被重新复制到 GPU(导致内存不足)

Dataset gets re-copied to GPU (causing out of memory) when calling to eval twice

这是我的一堆代码:

# I train a model, save it and then clear all with
del model
tf.keras.backend.clear_session()
gc.collect()
print(f"memory usage {tf.config.experimental.get_memory_info('GPU:0')['current'] / 10 ** 9} GB")
checkpoint_model = open_saved_model()    # returns a tf.keras.Model()
print(f"memory usage {tf.config.experimental.get_memory_info('GPU:0')['current'] / 10 ** 9} GB")
eval_result = checkpoint_model.evaluate(train_ds[0], train_ds[1], batch_size=30)
print(f"memory usage {tf.config.experimental.get_memory_info('GPU:0')['current'] / 10 ** 9} GB")
eval_result = checkpoint_model.evaluate(train_ds[0], train_ds[1], batch_size=30)

内存输出为:

memory usage 0.0 GB
memory usage 0.013005312 GB
memory usage 5.893292544 GB

在最后一行我得到 tensorflow.python.framework.errors_impl.InternalError(最后的完整消息)

我的火车数据集应该是 train_ds[0].size * train_ds[0].itemsize / 10**9 = 4.395368448 GB。

我的 GPU 可用大小(使用 nvidia-smi 命令)是 10481MiB / 11016MiB。如果我使用已用内存加上 numpy 数组,我得到 10.27146624 这是 tensorflow 决定分配的 10.48 的边界。更重要的是,虽然它保留了 10GB,但有一条消息(请参阅最后的完整错误消息)它有 8GB 内存(很奇怪,但它解释了为什么我内存不足)。

不管这个边界结果如何,再次分配数据集似乎是非常错误的。我应该重新使用 evaluate 中使用的数据集,或者将其替换为新数据集。

我尝试使用 train_dataset = tf.data.Dataset.from_tensor_slices((train_ds[0], train_ds[1])).batch(32) 并且 MWE 工作(内存使用量增加到 7.35GB)但是如果我将第二个 evaluate 更改为 predict(这实际上是我的真正目标)然后我得到了同样的错误。


我阅读了有关使用 os.environ["TF_GPU_ALLOCATOR"] = "cuda_malloc_async" 的信息,然后我只收到 Process finished with exit code 139 (interrupted by signal 11: SIGSEGV),没有任何错误消息。但是将它与其他消息进行比较,它在消息 Created device /job:localhost/replica:0/task:0/device:GPU:0 with 8965 MB memory: -> device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5 之前停止,这意味着我认为它无法“创建设备”。

动机

这是我设法重新创建的 MWE,但事实是我想 evaluatepredict 许多数据集应该每个5GB大小。我目前的解决方案应该是:

  1. 清除所有 GPU
  2. 加载模型
  3. 评估
  4. 清除所有 GPU
  5. 再次加载模型
  6. 预测

然后对我的几个数据集重复步骤 1 到 6(非常低效吧?)。

完整的错误信息

2022-04-06 13:24:49.708029: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-04-06 13:24:49.713198: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-04-06 13:24:49.713526: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-04-06 13:24:49.713988: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-04-06 13:24:49.714414: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-04-06 13:24:49.714715: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-04-06 13:24:49.715002: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-04-06 13:24:50.044152: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-04-06 13:24:50.044479: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-04-06 13:24:50.044766: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-04-06 13:24:50.045036: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 8965 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5
memory usage 0.0 GB
2022-04-06 13:25:00.250155: I tensorflow/core/profiler/lib/profiler_session.cc:131] Profiler session initializing.
2022-04-06 13:25:00.250170: I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started.
2022-04-06 13:25:00.250192: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1614] Profiler found 1 GPUs
2022-04-06 13:25:00.250349: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcupti.so.11.2'; dlerror: libcupti.so.11.2: cannot open shared object file: No such file or directory
2022-04-06 13:25:00.356485: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session tear down.
2022-04-06 13:25:00.356639: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1748] CUPTI activity buffer flushed
2022-04-06 13:25:00.372969: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 4396941312 exceeds 10% of free system memory.
2022-04-06 13:25:03.200488: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 4396941312 exceeds 10% of free system memory.
/home/barrachina/anaconda3/envs/tf-pip/lib/python3.9/site-packages/keras/utils/generic_utils.py:494: CustomMaskWarning: Custom mask layers require a config and must override get_config. When loading, the custom mask layer must be passed to the custom_objects argument.
  warnings.warn('Custom mask layers require a config and must override '
2022-04-06 13:25:05.075473: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
2022-04-06 13:25:07.796065: I tensorflow/stream_executor/cuda/cuda_dnn.cc:369] Loaded cuDNN version 8303
2022-04-06 13:25:08.177722: I tensorflow/core/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2022-04-06 13:25:08.177947: I tensorflow/core/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2022-04-06 13:25:08.177972: W tensorflow/stream_executor/gpu/asm_compiler.cc:77] Couldn't get ptxas version string: Internal: Couldn't invoke ptxas --version
2022-04-06 13:25:08.178231: I tensorflow/core/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2022-04-06 13:25:08.178262: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] Internal: Failed to launch ptxas
Relying on driver to perform ptx compilation. 
Modify $PATH to customize ptxas location.
This message will be only logged once.
  1/187 [..............................] - ETA: 11:00 - loss: 0.8855 - accuracy: 0.3187 - average_accuracy: 0.2666 - precision: 0.3264 - recall: 0.00222022-04-06 13:25:08.716360: I tensorflow/core/profiler/lib/profiler_session.cc:131] Profiler session initializing.
2022-04-06 13:25:08.716379: I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started.
  2/187 [..............................] - ETA: 1:05 - loss: 0.8388 - accuracy: 0.3063 - average_accuracy: 0.2759 - precision: 0.3169 - recall: 0.0049 2022-04-06 13:25:09.011233: I tensorflow/core/profiler/lib/profiler_session.cc:66] Profiler session collecting data.
2022-04-06 13:25:09.011432: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1748] CUPTI activity buffer flushed
2022-04-06 13:25:09.040157: I tensorflow/core/profiler/internal/gpu/cupti_collector.cc:673]  GpuTracer has collected 705 callback api events and 707 activity events. 
2022-04-06 13:25:09.049283: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session tear down.
2022-04-06 13:25:09.061327: I tensorflow/core/profiler/rpc/client/save_profile.cc:136] Creating directory: log/2022/04April/06Wednesday/run-13h24m42/tensorboard/train/plugins/profile/2022_04_06_13_25_09

2022-04-06 13:25:09.071522: I tensorflow/core/profiler/rpc/client/save_profile.cc:142] Dumped gzipped tool data for trace.json.gz to log/2022/04April/06Wednesday/run-13h24m42/tensorboard/train/plugins/profile/2022_04_06_13_25_09/barrachina-SONDRA.trace.json.gz
2022-04-06 13:25:09.096291: I tensorflow/core/profiler/rpc/client/save_profile.cc:136] Creating directory: log/2022/04April/06Wednesday/run-13h24m42/tensorboard/train/plugins/profile/2022_04_06_13_25_09

2022-04-06 13:25:09.101018: I tensorflow/core/profiler/rpc/client/save_profile.cc:142] Dumped gzipped tool data for memory_profile.json.gz to log/2022/04April/06Wednesday/run-13h24m42/tensorboard/train/plugins/profile/2022_04_06_13_25_09/barrachina-SONDRA.memory_profile.json.gz
2022-04-06 13:25:09.101899: I tensorflow/core/profiler/rpc/client/capture_profile.cc:251] Creating directory: log/2022/04April/06Wednesday/run-13h24m42/tensorboard/train/plugins/profile/2022_04_06_13_25_09
Dumped tool data for xplane.pb to log/2022/04April/06Wednesday/run-13h24m42/tensorboard/train/plugins/profile/2022_04_06_13_25_09/barrachina-SONDRA.xplane.pb
Dumped tool data for overview_page.pb to log/2022/04April/06Wednesday/run-13h24m42/tensorboard/train/plugins/profile/2022_04_06_13_25_09/barrachina-SONDRA.overview_page.pb
Dumped tool data for input_pipeline.pb to log/2022/04April/06Wednesday/run-13h24m42/tensorboard/train/plugins/profile/2022_04_06_13_25_09/barrachina-SONDRA.input_pipeline.pb
Dumped tool data for tensorflow_stats.pb to log/2022/04April/06Wednesday/run-13h24m42/tensorboard/train/plugins/profile/2022_04_06_13_25_09/barrachina-SONDRA.tensorflow_stats.pb
Dumped tool data for kernel_stats.pb to log/2022/04April/06Wednesday/run-13h24m42/tensorboard/train/plugins/profile/2022_04_06_13_25_09/barrachina-SONDRA.kernel_stats.pb

187/187 [==============================] - 10s 37ms/step - loss: 0.8277 - accuracy: 0.5412 - average_accuracy: 0.3043 - precision: 0.5026 - recall: 0.0087 - val_loss: 0.8309 - val_accuracy: 0.6880 - val_average_accuracy: 0.2931 - val_precision: 0.6810 - val_recall: 0.0047
memory usage 6.042584576 GB
memory usage 0.006478336 GB
2022-04-06 13:25:16.022531: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 4396941312 exceeds 10% of free system memory.
memory usage 0.012938752 GB
2022-04-06 13:25:18.885690: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 4396941312 exceeds 10% of free system memory.
187/187 [==============================] - 4s 16ms/step - loss: 0.8138 - accuracy: 0.6999 - average_accuracy: 0.2968 - precision: 0.6710 - recall: 0.0058
memory usage 5.90003712 GB
2022-04-06 13:25:24.458057: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 4396941312 exceeds 10% of free system memory.
2022-04-06 13:25:35.851249: W tensorflow/core/common_runtime/bfc_allocator.cc:457] Allocator (GPU_0_bfc) ran out of memory trying to allocate 4.09GiB (rounded to 4396941312)requested by op _EagerConst
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. 
Current allocation summary follows.
Current allocation summary follows.
2022-04-06 13:25:35.851336: I tensorflow/core/common_runtime/bfc_allocator.cc:1004] BFCAllocator dump for GPU_0_bfc
2022-04-06 13:25:35.851375: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (256):  Total Chunks: 263, Chunks in use: 263. 65.8KiB allocated for chunks. 65.8KiB in use in bin. 15.2KiB client-requested in use in bin.
2022-04-06 13:25:35.851405: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (512):  Total Chunks: 71, Chunks in use: 70. 42.2KiB allocated for chunks. 41.8KiB in use in bin. 36.0KiB client-requested in use in bin.
2022-04-06 13:25:35.851432: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (1024):     Total Chunks: 10, Chunks in use: 9. 15.0KiB allocated for chunks. 14.0KiB in use in bin. 12.6KiB client-requested in use in bin.
2022-04-06 13:25:35.851456: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (2048):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-04-06 13:25:35.851483: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (4096):     Total Chunks: 6, Chunks in use: 6. 31.5KiB allocated for chunks. 31.5KiB in use in bin. 30.4KiB client-requested in use in bin.
2022-04-06 13:25:35.851511: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (8192):     Total Chunks: 12, Chunks in use: 12. 123.0KiB allocated for chunks. 123.0KiB in use in bin. 121.5KiB client-requested in use in bin.
2022-04-06 13:25:35.851534: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (16384):    Total Chunks: 1, Chunks in use: 0. 30.2KiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-04-06 13:25:35.851560: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (32768):    Total Chunks: 13, Chunks in use: 11. 579.0KiB allocated for chunks. 475.5KiB in use in bin. 445.5KiB client-requested in use in bin.
2022-04-06 13:25:35.851586: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (65536):    Total Chunks: 1, Chunks in use: 1. 73.8KiB allocated for chunks. 73.8KiB in use in bin. 40.5KiB client-requested in use in bin.
2022-04-06 13:25:35.851610: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (131072):   Total Chunks: 18, Chunks in use: 18. 2.85MiB allocated for chunks. 2.85MiB in use in bin. 2.72MiB client-requested in use in bin.
2022-04-06 13:25:35.851634: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (262144):   Total Chunks: 2, Chunks in use: 1. 769.5KiB allocated for chunks. 283.5KiB in use in bin. 162.0KiB client-requested in use in bin.
2022-04-06 13:25:35.851658: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (524288):   Total Chunks: 11, Chunks in use: 10. 6.96MiB allocated for chunks. 6.33MiB in use in bin. 6.33MiB client-requested in use in bin.
2022-04-06 13:25:35.851682: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (1048576):  Total Chunks: 2, Chunks in use: 2. 2.25MiB allocated for chunks. 2.25MiB in use in bin. 1.27MiB client-requested in use in bin.
2022-04-06 13:25:35.851704: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (2097152):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-04-06 13:25:35.851725: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (4194304):  Total Chunks: 1, Chunks in use: 0. 4.43MiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-04-06 13:25:35.851769: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (8388608):  Total Chunks: 1, Chunks in use: 0. 12.44MiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-04-06 13:25:35.851799: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (16777216):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-04-06 13:25:35.851821: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (33554432):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-04-06 13:25:35.851841: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (67108864):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-04-06 13:25:35.851865: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (134217728):    Total Chunks: 1, Chunks in use: 0. 128.08MiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-04-06 13:25:35.851889: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (268435456):    Total Chunks: 3, Chunks in use: 2. 8.60GiB allocated for chunks. 5.48GiB in use in bin. 5.46GiB client-requested in use in bin.
2022-04-06 13:25:35.851911: I tensorflow/core/common_runtime/bfc_allocator.cc:1027] Bin for 4.09GiB was 256.00MiB, Chunk State: 
2022-04-06 13:25:35.851941: I tensorflow/core/common_runtime/bfc_allocator.cc:1033]   Size: 3.12GiB | Requested Size: 1.97MiB | in_use: 0 | bin_num: 20, prev:   Size: 512B | Requested Size: 384B | in_use: 1 | bin_num: -1
2022-04-06 13:25:35.851960: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Next region of size 9401270272
2022-04-06 13:25:35.851981: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 7fc942000000 of size 256 next 4
2022-04-06 13:25:35.851999: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 7fc942000100 of size 256 next 6
2022-04-06 13:25:35.852016: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 7fc942000200 of size 256 next 3
2022-04-06 13:25:35.852032: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 7fc942000300 of size 256 next 5
2022-04-06 13:25:35.852048: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 7fc942000400 of size 256 next 9
2022-04-06 13:25:35.852064: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 7fc942000500 of size 256 next 7
2022-04-06 13:25:35.852080: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 7fc942000600 of size 256 next 8
2022-04-06 13:25:35.852097: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 7fc942000700 of size 256 next 10
2022-04-06 13:25:35.852113: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 7fc942000800 of size 256 next 13
2022-04-06 13:25:35.852128: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 7fc942000900 of size 256 next 14
2022-04-06 13:25:35.852144: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 7fc942000a00 of size 256 next 15
2022-04-06 13:25:35.852159: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 7fc942000b00 of size 256 next 83
2022-04-06 13:25:35.852174: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 7fc942000c00 of size 256 next 17
2022-04-06 13:25:35.852189: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 7fc942000d00 of size 256 next 18
2022-04-06 13:25:35.852204: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 7fc942000e00 of size 256 next 21
.... Many messages like this, Whosebug limits my max characters so I cropped it.
2022-04-06 13:25:35.858545: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 7fcaaace5700 of size 512 next 320
2022-04-06 13:25:35.858561: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] Free  at 7fcaaace5900 of size 3347949312 next 18446744073709551615
2022-04-06 13:25:35.858576: I tensorflow/core/common_runtime/bfc_allocator.cc:1065]      Summary of in-use Chunks by size: 
2022-04-06 13:25:35.858600: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 263 Chunks of size 256 totalling 65.8KiB
2022-04-06 13:25:35.858620: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 43 Chunks of size 512 totalling 21.5KiB
2022-04-06 13:25:35.858639: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 27 Chunks of size 768 totalling 20.2KiB
2022-04-06 13:25:35.858658: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 1 Chunks of size 1024 totalling 1.0KiB
2022-04-06 13:25:35.858675: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 2 Chunks of size 1280 totalling 2.5KiB
2022-04-06 13:25:35.858694: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 6 Chunks of size 1792 totalling 10.5KiB
2022-04-06 13:25:35.858712: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 6 Chunks of size 5376 totalling 31.5KiB
2022-04-06 13:25:35.858732: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 12 Chunks of size 10496 totalling 123.0KiB
2022-04-06 13:25:35.858751: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 9 Chunks of size 41472 totalling 364.5KiB
2022-04-06 13:25:35.858770: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 1 Chunks of size 51456 totalling 50.2KiB
2022-04-06 13:25:35.858789: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 1 Chunks of size 62208 totalling 60.8KiB
2022-04-06 13:25:35.858807: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 1 Chunks of size 75520 totalling 73.8KiB
2022-04-06 13:25:35.858826: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 5 Chunks of size 147456 totalling 720.0KiB
2022-04-06 13:25:35.858845: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 11 Chunks of size 165888 totalling 1.74MiB
2022-04-06 13:25:35.858863: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 1 Chunks of size 176640 totalling 172.5KiB
2022-04-06 13:25:35.858882: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 1 Chunks of size 248832 totalling 243.0KiB
2022-04-06 13:25:35.858901: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 1 Chunks of size 290304 totalling 283.5KiB
2022-04-06 13:25:35.858919: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 10 Chunks of size 663552 totalling 6.33MiB
2022-04-06 13:25:35.858937: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 2 Chunks of size 1179648 totalling 2.25MiB
2022-04-06 13:25:35.858955: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 1 Chunks of size 1489978112 totalling 1.39GiB
2022-04-06 13:25:35.858973: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 1 Chunks of size 4396941312 totalling 4.09GiB
2022-04-06 13:25:35.858991: I tensorflow/core/common_runtime/bfc_allocator.cc:1072] Sum Total of in-use chunks: 5.49GiB
2022-04-06 13:25:35.859009: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] total_region_allocated_bytes_: 9401270272 memory_limit_: 9401270272 available bytes: 0 curr_region_allocation_bytes_: 18802540544
2022-04-06 13:25:35.859036: I tensorflow/core/common_runtime/bfc_allocator.cc:1080] Stats: 
Limit:                      9401270272
InUse:                      5900037120
MaxInUse:                   6431716864
NumAllocs:                      165083
MaxAllocSize:               4396941312
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0

2022-04-06 13:25:35.859127: W tensorflow/core/common_runtime/bfc_allocator.cc:468] *****************************************************************___________________________________
Traceback (most recent call last):
  File "/home/barrachina/Documents/onera/PolSar/principal_simulation.py", line 524, in <module>
    run_wrapper(model_name=args.model[0], balance=args.balance[0], tensorflow=args.tensorflow,
  File "/home/barrachina/Documents/onera/PolSar/principal_simulation.py", line 504, in run_wrapper
    df, dataset_handler, eval_df = run_model(model_name=model_name, balance=balance, tensorflow=tensorflow,
  File "/home/barrachina/Documents/onera/PolSar/principal_simulation.py", line 440, in run_model
    prediction_result = checkpoint_model.predict(train_ds[0])
  File "/home/barrachina/anaconda3/envs/tf-pip/lib/python3.9/site-packages/keras/engine/training.py", line 1720, in predict
    data_handler = data_adapter.get_data_handler(
  File "/home/barrachina/anaconda3/envs/tf-pip/lib/python3.9/site-packages/keras/engine/data_adapter.py", line 1383, in get_data_handler
    return DataHandler(*args, **kwargs)
  File "/home/barrachina/anaconda3/envs/tf-pip/lib/python3.9/site-packages/keras/engine/data_adapter.py", line 1138, in __init__
    self._adapter = adapter_cls(
  File "/home/barrachina/anaconda3/envs/tf-pip/lib/python3.9/site-packages/keras/engine/data_adapter.py", line 230, in __init__
    x, y, sample_weights = _process_tensorlike((x, y, sample_weights))
  File "/home/barrachina/anaconda3/envs/tf-pip/lib/python3.9/site-packages/keras/engine/data_adapter.py", line 1031, in _process_tensorlike
    inputs = tf.nest.map_structure(_convert_numpy_and_scipy, inputs)
  File "/home/barrachina/anaconda3/envs/tf-pip/lib/python3.9/site-packages/tensorflow/python/util/nest.py", line 869, in map_structure
    structure[0], [func(*x) for x in entries],
  File "/home/barrachina/anaconda3/envs/tf-pip/lib/python3.9/site-packages/tensorflow/python/util/nest.py", line 869, in <listcomp>
    structure[0], [func(*x) for x in entries],
  File "/home/barrachina/anaconda3/envs/tf-pip/lib/python3.9/site-packages/keras/engine/data_adapter.py", line 1026, in _convert_numpy_and_scipy
    return tf.convert_to_tensor(x, dtype=dtype)
  File "/home/barrachina/anaconda3/envs/tf-pip/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper
    return target(*args, **kwargs)
  File "/home/barrachina/anaconda3/envs/tf-pip/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 1430, in convert_to_tensor_v2_with_dispatch
    return convert_to_tensor_v2(
  File "/home/barrachina/anaconda3/envs/tf-pip/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 1436, in convert_to_tensor_v2
    return convert_to_tensor(
  File "/home/barrachina/anaconda3/envs/tf-pip/lib/python3.9/site-packages/tensorflow/python/profiler/trace.py", line 163, in wrapped
    return func(*args, **kwargs)
  File "/home/barrachina/anaconda3/envs/tf-pip/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 1566, in convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "/home/barrachina/anaconda3/envs/tf-pip/lib/python3.9/site-packages/tensorflow/python/framework/tensor_conversion_registry.py", line 52, in _default_conversion_function
    return constant_op.constant(value, dtype, name=name)
  File "/home/barrachina/anaconda3/envs/tf-pip/lib/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 271, in constant
    return _constant_impl(value, dtype, shape, name, verify_shape=False,
  File "/home/barrachina/anaconda3/envs/tf-pip/lib/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 283, in _constant_impl
    return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
  File "/home/barrachina/anaconda3/envs/tf-pip/lib/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 308, in _constant_eager_impl
    t = convert_to_eager_tensor(value, ctx, dtype)
  File "/home/barrachina/anaconda3/envs/tf-pip/lib/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 106, in convert_to_eager_tensor
    return ops.EagerTensor(value, ctx.device_name, dtype)
tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized.

这个问题可能与this有关。

对我来说(仍然超级奇怪)的解决方案是:

train_x = tf.convert_to_tensor(train_ds[0])

并使用 train_x 而不是 train_ds[0]

现在是奇怪的事情。如果我也这样做 train_y = tf.convert_to_tensor(train_ds[1]) 它不起作用。我只需要转换 train_ds[0] 并且只有那个。我所说的工作是指我可以在不清除所有内容的情况下进行评估和预测。

train_ds 应该是 tf.data.Dataset 或至少是 tf.Tensor。如果它是一个 numpy 数组或列表或熊猫数据结构,您将无法获得 TF 的最大性能和内存分配等优化。