Keras MultiGPU training fails with error message, "IndexError: pop from empty list"
Keras MultiGPU training fails with error message, "IndexError: pop from empty list"
我想利用多个 GPU 使用 tf.distribute.MirroredStrategy()
方法训练我的 Keras/Tensorflow 模型。
下面是我的代码片段:
# Imports
import tensorflow as tf
import model # Module of functions for building the model
# Check GPU availability
devices = tf.config.list_physical_devices('GPU')
print('Num GPUs:', len(devices))
print(devices)
# Prepare dataset (Xtrain/Xtest are Numpy arrays with shape, (None, 600, 23))
Xtrain, Xtest = models.get_dataset()
# Datasets as tf.data.dataset objects
batch_size = 256
train_dataset = tf.data.Dataset.from_tensor_slices((Xtrain, Xtrain)).batch(batch_size)
test_dataset = tf.data.Dataset.from_tensor_slices((Xtest, Xtest)).batch(batch_size)
# Build model for synchronous multi-GPU training
strategy = tf.distribute.MirroredStrategy()
print('Number of devices in strategy: {}'.format(strategy.num_replicas_in_sync))
with strategy.scope():
# Define model hyperparameters
input_dim = Xtrain.shape[1:]
clipnorm = 100
learning_rate = 1e-4
latent_dim = 50
dropout = 0.33
# Compile model
encoder = models.Encoder(input_dim=input_dim, latent_dim=latent_dim,
dropout=dropout)
decoder = models.Decoder(input_dim=input_dim, latent_dim=latent_dim,
dropout=dropout)
m1vae = models.ProtVAE(encoder=encoder, decoder=decoder, name='m1vae')
m1vae.compileVAE(input_dim=input_dim, latent_dim=latent_dim,
learning_rate=learning_rate, clipnorm=clipnorm)
当我 运行 代码时,如果编译步骤失败并显示以下错误消息:
Num GPUs: 2
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')]
Number of devices in strategy: 2
Traceback (most recent call last):
File "work_python_scripts/test_m1vae_gpu.py", line 114, in <module>
m1vae.compileVAE(input_dim=input_dim, latent_dim=latent_dim, learning_rate=learning_rate,
File "/home/jgado/condaenvs/tfgpu/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py", line 332, in __ex\
it__
_pop_per_thread_mode()
File "/home/jgado/condaenvs/tfgpu/lib/python3.8/site-packages/tensorflow/python/distribute/distribution_strategy_context.py", li\
ne 65, in _pop_per_thread_mode
ops.get_default_graph()._distribution_strategy_stack.pop(-1) # pylint: disable=protected-access
IndexError: pop from empty list
我想知道这是不是因为我的函数(Encoder
、Decoder
、ProtVAE
和CompileVAE
)是在单独的模块中定义的(models.py
).但我觉得这应该不是问题,因为这些函数是在 strategy.scope() 块中调用的。
签入您的模块 (models.py)。
注释掉所有清除会话函数。例如,K.clear_session()
我想利用多个 GPU 使用 tf.distribute.MirroredStrategy()
方法训练我的 Keras/Tensorflow 模型。
下面是我的代码片段:
# Imports
import tensorflow as tf
import model # Module of functions for building the model
# Check GPU availability
devices = tf.config.list_physical_devices('GPU')
print('Num GPUs:', len(devices))
print(devices)
# Prepare dataset (Xtrain/Xtest are Numpy arrays with shape, (None, 600, 23))
Xtrain, Xtest = models.get_dataset()
# Datasets as tf.data.dataset objects
batch_size = 256
train_dataset = tf.data.Dataset.from_tensor_slices((Xtrain, Xtrain)).batch(batch_size)
test_dataset = tf.data.Dataset.from_tensor_slices((Xtest, Xtest)).batch(batch_size)
# Build model for synchronous multi-GPU training
strategy = tf.distribute.MirroredStrategy()
print('Number of devices in strategy: {}'.format(strategy.num_replicas_in_sync))
with strategy.scope():
# Define model hyperparameters
input_dim = Xtrain.shape[1:]
clipnorm = 100
learning_rate = 1e-4
latent_dim = 50
dropout = 0.33
# Compile model
encoder = models.Encoder(input_dim=input_dim, latent_dim=latent_dim,
dropout=dropout)
decoder = models.Decoder(input_dim=input_dim, latent_dim=latent_dim,
dropout=dropout)
m1vae = models.ProtVAE(encoder=encoder, decoder=decoder, name='m1vae')
m1vae.compileVAE(input_dim=input_dim, latent_dim=latent_dim,
learning_rate=learning_rate, clipnorm=clipnorm)
当我 运行 代码时,如果编译步骤失败并显示以下错误消息:
Num GPUs: 2
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')]
Number of devices in strategy: 2
Traceback (most recent call last):
File "work_python_scripts/test_m1vae_gpu.py", line 114, in <module>
m1vae.compileVAE(input_dim=input_dim, latent_dim=latent_dim, learning_rate=learning_rate,
File "/home/jgado/condaenvs/tfgpu/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py", line 332, in __ex\
it__
_pop_per_thread_mode()
File "/home/jgado/condaenvs/tfgpu/lib/python3.8/site-packages/tensorflow/python/distribute/distribution_strategy_context.py", li\
ne 65, in _pop_per_thread_mode
ops.get_default_graph()._distribution_strategy_stack.pop(-1) # pylint: disable=protected-access
IndexError: pop from empty list
我想知道这是不是因为我的函数(Encoder
、Decoder
、ProtVAE
和CompileVAE
)是在单独的模块中定义的(models.py
).但我觉得这应该不是问题,因为这些函数是在 strategy.scope() 块中调用的。
签入您的模块 (models.py)。
注释掉所有清除会话函数。例如,K.clear_session()