尽管 max_seq_len 发生变化,hub.KerasLayer() 始终消耗相同的 GPU 内存

hub.KerasLayer() always comsumes the same GPU memory despite the changing max_seq_len

我正在使用来自 tensorflow hub 的 Bert,在 the original Bert repository 中注意到这一点后,我想通过减少 Bert 模型的 max_seq_len 来节省 GPU 内存:

max_seq_length: The released models were trained with sequence lengths up to 512, but you can fine-tune with a shorter max sequence length to save substantial memory. This is controlled by the max_seq_length flag in our example code.

但在我的测试中,尽管 max_seq_len 发生了变化,但 Bert 模型始终使用相同的 GPU 内存。这是我的测试脚本。

import numpy as np
import tensorflow_hub as hub
import tensorflow as tf

num_sample = 1000
batch_size = 10
max_seq_len = 512
num_class = 30
vocab_num = 30000
epochs = 100
learning_rate = 1e-5

# get the pooled_output of Bert and pass it to a dense layer
def bert_model():
    input_ids = tf.keras.Input((max_seq_len,), dtype=tf.int32, name='input_ids')
    input_masks = tf.keras.Input((max_seq_len,), dtype=tf.int32, name='input_masks')
    input_segments = tf.keras.Input((max_seq_len,), dtype=tf.int32, name='input_segments')

    bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1",
                            trainable=True)

    pooled_output, sequence_output = bert_layer([input_ids, input_masks, input_segments])

    out = tf.keras.layers.Dense(num_class, activation="sigmoid", name="dense_output")(pooled_output)

    model = tf.keras.models.Model(inputs=[input_ids, input_masks, input_segments], outputs=out)

    return model

outputs = np.random.randn(num_sample, num_class)
inputs = [np.random.randint(vocab_num, size=(num_sample, max_seq_len), dtype=np.int32),  # ids
          np.ones((num_sample, max_seq_len), dtype=np.int32),  # masks
          np.zeros((num_sample, max_seq_len), dtype=np.int32)]  # segments

model = bert_model()
print(model.summary())

optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
model.compile(loss='binary_crossentropy', optimizer=optimizer)  # multi-lebel task
model.fit(inputs, outputs, epochs=epochs, verbose=1, batch_size=batch_size)

max_seq_len512 并且我通过输入 CUDA_VISIBLE_DEVICES=1 python bert_test.py 在 GPU 1 上 运行 脚本时,我得到以下结果。

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to
==================================================================================================
input_ids (InputLayer)          [(None, 512)]        0
__________________________________________________________________________________________________
input_masks (InputLayer)        [(None, 512)]        0
__________________________________________________________________________________________________
input_segments (InputLayer)     [(None, 512)]        0
__________________________________________________________________________________________________
keras_layer (KerasLayer)        [(None, 768), (None, 109482241   input_ids[0][0]
                                                                 input_masks[0][0]
                                                                 input_segments[0][0]
__________________________________________________________________________________________________
dense_output (Dense)            (None, 30)           23070       keras_layer[0][0]
==================================================================================================
Total params: 109,505,311
Trainable params: 109,505,310
Non-trainable params: 1
__________________________________________________________________________________________________
None
Train on 1000 samples
Epoch 1/100
2019-12-26 08:54:44.071737: W tensorflow/core/common_runtime/shape_refiner.cc:89] Function instantiation has undefined input shape at index: 1211 in the outer inference context.
2019-12-26 08:54:45.962313: W tensorflow/core/common_runtime/shape_refiner.cc:89] Function instantiation has undefined input shape at index: 1211 in the outer inference context.
2019-12-26 08:54:57.818644: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
 900/1000 [==========================>...] - ETA: 8s - loss: 0.2933

命令nvidia-smi告诉我进度占用了GPU 1的10765MiB

Every 0.5s: nvidia-smi                                                                                                                                                          Thu Dec 26 08:56:22 2019

Thu Dec 26 08:56:22 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.64       Driver Version: 430.64       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:02:00.0 Off |                  N/A |
| 46%   77C    P2    82W / 250W |  10895MiB / 11178MiB |     10%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:03:00.0 Off |                  N/A |
| 58%   86C    P2   195W / 250W |  10765MiB / 11178MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:82:00.0 Off |                  N/A |
| 88%   86C    P2   150W / 250W |   5930MiB / 11178MiB |     92%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 108...  Off  | 00000000:83:00.0 Off |                  N/A |
| 23%   38C    P8     9W / 250W |    805MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     25551      C   python                                     10885MiB |
|    1     24838      C   python                                     10755MiB |
|    2      8663      C   python                                       395MiB |
|    2     28173      C   python                                      5525MiB |
|    3     15501      C   python                                       795MiB |
+-----------------------------------------------------------------------------+

然后无论max_seq_len我用什么,我得到的结果都是一样的,即GPU内存的使用保持不变。例如,这是输出当我使用 max_seq_len=64.

模型总结和训练信息:


Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to
==================================================================================================
input_ids (InputLayer)          [(None, 64)]         0
__________________________________________________________________________________________________
input_masks (InputLayer)        [(None, 64)]         0
__________________________________________________________________________________________________
input_segments (InputLayer)     [(None, 64)]         0
__________________________________________________________________________________________________
keras_layer (KerasLayer)        [(None, 768), (None, 109482241   input_ids[0][0]
                                                                 input_masks[0][0]
                                                                 input_segments[0][0]
__________________________________________________________________________________________________
dense_output (Dense)            (None, 30)           23070       keras_layer[0][0]
==================================================================================================
Total params: 109,505,311
Trainable params: 109,505,310
Non-trainable params: 1
__________________________________________________________________________________________________
None
Train on 1000 samples
Epoch 1/100
2019-12-26 08:58:01.458129: W tensorflow/core/common_runtime/shape_refiner.cc:89] Function instantiation has undefined input shape at index: 1211 in the outer inference context.
2019-12-26 08:58:03.176888: W tensorflow/core/common_runtime/shape_refiner.cc:89] Function instantiation has undefined input shape at index: 1211 in the outer inference context.
2019-12-26 08:58:14.005948: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
1000/1000 [==============================] - 29s 29ms/sample - loss: 0.3040
Epoch 2/100
 280/1000 [=======>......................] - ETA: 6s - loss: 0.1366

以及GPU使用信息:

Every 0.5s: nvidia-smi                                                                                                                                                          Thu Dec 26 08:59:10 2019

Thu Dec 26 08:59:10 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.64       Driver Version: 430.64       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:02:00.0 Off |                  N/A |
| 46%   78C    P2   277W / 250W |  10895MiB / 11178MiB |     36%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:03:00.0 Off |                  N/A |
| 75%   86C    P2   222W / 250W |  10765MiB / 11178MiB |     93%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:82:00.0 Off |                  N/A |
| 88%   88C    P2   175W / 250W |   5930MiB / 11178MiB |     96%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 108...  Off  | 00000000:83:00.0 Off |                  N/A |
| 23%   39C    P8     9W / 250W |    805MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     25551      C   python                                     10885MiB |
|    1     29332      C   python                                     10755MiB |
|    2      8663      C   python                                       395MiB |
|    2     28173      C   python                                      5525MiB |
|    3     15501      C   python                                       795MiB |
+-----------------------------------------------------------------------------+

当使用较小的max_seq_len时,训练确实更快,但我更关心内存使用。那么有人可以帮我吗?任何建议将不胜感激!

我使用了 Tensorflow document 中的代码并解决了问题。

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  try:
    # Currently, memory growth needs to be the same across GPUs
    for gpu in gpus:
      tf.config.experimental.set_memory_growth(gpu, True)
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Memory growth must be set before GPUs have been initialized
    print(e)