如何打印 Keras 的 model.fit() 期间使用的最大内存

How to print the maximum memory used during Keras's model.fit()

我使用 KerasTensorflow 编写了一个神经网络模型,并且能够对其进行训练和 运行。此时,我想知道训练模型需要多少内存。如何在训练阶段打印这些信息?我尝试了下面的 Keras 模型分析器,但它没有解释训练阶段所需的峰值内存。例如,训练我的模型显示 6GB GPU 卡内存不足,但配置文件显示内存要求小于 1GB。那么,当我在 Keras 中使用 model.fit() 时,如何测量峰值 运行 时间内存需求?

https://github.com/Mr-TalhaIlyas/Tensorflow-Keras-Model-Profiler

例如,我建议使用 Keras Callback 并在每个纪元后打印 GPU 使用情况。您可以使用 tf.config.experimental.get_memory_info('GPU:0') 获取 GPU 信息。这是一个工作示例:

import tensorflow as tf

class MemoryPrintingCallback(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs=None):
      gpu_dict = tf.config.experimental.get_memory_info('GPU:0')
      tf.print('\n GPU memory details [current: {} gb, peak: {} gb]'.format(
          float(gpu_dict['current']) / (1024 ** 3), 
          float(gpu_dict['peak']) / (1024 ** 3)))
        
inputs = tf.keras.layers.Input((1000,))
x = tf.keras.layers.Dense(1000, 'relu')(inputs)
x = tf.keras.layers.Dense(1000, 'relu')(x)
x = tf.keras.layers.Dense(1000, 'relu')(x)
x = tf.keras.layers.Dense(1000, 'relu')(x)
outputs = tf.keras.layers.Dense(1, 'sigmoid')(x)
model = tf.keras.Model(inputs, outputs)
model.compile(optimizer='adam', loss=tf.keras.losses.BinaryCrossentropy())

x = tf.random.normal((500, 1000))
y = tf.random.uniform((500, 1), maxval=2, dtype=tf.int32)
model.fit(x, y, batch_size=50, epochs = 20, callbacks= [MemoryPrintingCallback()])
 GPU memory details [current: 0.321030855178833 gb, peak: 0.32660841941833496 gb]
Epoch 1/20
10/10 [==============================] - 1s 8ms/step - loss: 0.9309

 GPU memory details [current: 0.3508758544921875 gb, peak: 0.3557243347167969 gb]
Epoch 2/20
10/10 [==============================] - 0s 7ms/step - loss: 0.5702

 GPU memory details [current: 0.3508758544921875 gb, peak: 0.3557243347167969 gb]
Epoch 3/20
10/10 [==============================] - 0s 8ms/step - loss: 0.1311

 GPU memory details [current: 0.3508758544921875 gb, peak: 0.3557243347167969 gb]
Epoch 4/20
10/10 [==============================] - 0s 7ms/step - loss: 0.0865

 GPU memory details [current: 0.3508758544921875 gb, peak: 0.3661658763885498 gb]
Epoch 5/20
10/10 [==============================] - 0s 7ms/step - loss: 0.0379
...

您可以通过以下方式找到您的设备名称:

print(tf.config.list_physical_devices('GPU'))
#[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

不过请注意以下几点:

For GPUs, TensorFlow will allocate all the memory by default, unless changed with tf.config.experimental.set_memory_growth. The dict specifies only the current and peak memory that TensorFlow is actually using, not the memory that TensorFlow has allocated on the GPU. source