如何计算最佳批量大小
How to calculate optimal batch size
有时候我运行遇到一个问题:
OOM when allocating tensor with shape
e.q.
OOM when allocating tensor with shape (1024, 100, 160)
其中 1024 是我的批量大小,我不知道其余的是多少。如果我减少批量大小或模型中的神经元数量,它 运行 没问题。
是否有一种通用的方法可以根据模型和 GPU 内存计算最佳批量大小,从而使程序不会崩溃?
简而言之:就我的模型而言,我想要最大的批量大小,这将适合我的 GPU 内存并且不会使程序崩溃。
摘自 Goodfellow 等人最近出版的《深度学习》一书,chapter 8:
Minibatch sizes are generally driven by the following factors:
- Larger batches provide a more accurate estimate of the gradient, but
with less than linear returns.
- Multicore architectures are usually
underutilized by extremely small batches. This motivates using some
absolute minimum batch size, below which there is no reduction in the
time to process a minibatch.
- If all examples in the batch are to be
processed in parallel (as is typically the case), then the amount of
memory scales with the batch size. For many hardware setups this is
the limiting factor in batch size.
- Some kinds of hardware achieve
better runtime with specific sizes of arrays. Especially when using
GPUs, it is common for power of 2 batch sizes to offer better runtime.
Typical power of 2 batch sizes range from 32 to 256, with 16 sometimes
being attempted for large models.
- Small batches can offer a
regularizing effect (Wilson and Martinez, 2003), perhaps due to the
noise they add to the learning process. Generalization error is often
best for a batch size of 1. Training with such a small batch size
might require a small learning rate to maintain stability because of
the high variance in the estimate of the gradient. The total runtime
can be very high as a result of the need to make more steps, both
because of the reduced learning rate and because it takes more steps
to observe the entire training set.
这在实践中通常意味着“2 的幂并且越大越好,前提是该批次适合您的 (GPU) 内存”。
您可能还想参考 Stack Exchange 中的几篇好文章:
- Tradeoff batch size vs. number of iterations to train a neural network
- How large should the batch size be for stochastic gradient descent?
请记住,Keskar 等人的论文。 'On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima', quoted by several of the posts above, has received some objections 深度学习社区的其他受人尊敬的研究人员。
希望这对您有所帮助...
更新(2017 年 12 月):
Yoshua Bengio 及其团队发表了一篇新论文,Three Factors Influencing Minima in SGD(2017 年 11 月);从某种意义上说,它值得一读,因为它报告了关于学习率和批量大小之间相互作用的新理论和实验结果。
更新(2021 年 3 月):
这里还有另一篇 2018 年的论文,Revisiting Small Batch Training for Deep Neural Networks(h/t 致 Nicolas Gervais),这与 越大越好 的建议相反;摘自摘要:
The best performance has been consistently obtained for mini-batch sizes between m=2 and m=32, which contrasts with recent work advocating the use of mini-batch sizes in the thousands.
您可以使用以下方法估算最大批量大小:
最大批量大小=可用GPU内存字节/4/(张量大小+可训练参数)
我 运行 遇到类似的 GPU 内存错误,该错误已通过使用以下配置 tensorflow 会话解决:
# See https://www.tensorflow.org/tutorials/using_gpu#allowing_gpu_memory_growth
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
参见:
这是一个函数,用于查找用于训练模型的批量大小:
def FindBatchSize(model):
"""model: model architecture, that is yet to be trained"""
import os, sys, psutil, gc, tensorflow, keras
import numpy as np
from keras import backend as K
BatchFound= 16
try:
total_params= int(model.count_params()); GCPU= "CPU"
#find whether gpu is available
try:
if K.tensorflow_backend._get_available_gpus()== []:
GCPU= "CPU"; #CPU and Cuda9GPU
else:
GCPU= "GPU"
except:
from tensorflow.python.client import device_lib; #Cuda8GPU
def get_available_gpus():
local_device_protos= device_lib.list_local_devices()
return [x.name for x in local_device_protos if x.device_type == 'GPU']
if "gpu" not in str(get_available_gpus()).lower():
GCPU= "CPU"
else:
GCPU= "GPU"
#decide batch size on the basis of GPU availability and model complexity
if (GCPU== "GPU") and (os.cpu_count() >15) and (total_params <1000000):
BatchFound= 64
if (os.cpu_count() <16) and (total_params <500000):
BatchFound= 64
if (GCPU== "GPU") and (os.cpu_count() >15) and (total_params <2000000) and (total_params >=1000000):
BatchFound= 32
if (GCPU== "GPU") and (os.cpu_count() >15) and (total_params >=2000000) and (total_params <10000000):
BatchFound= 16
if (GCPU== "GPU") and (os.cpu_count() >15) and (total_params >=10000000):
BatchFound= 8
if (os.cpu_count() <16) and (total_params >5000000):
BatchFound= 8
if total_params >100000000:
BatchFound= 1
except:
pass
try:
#find percentage of memory used
memoryused= psutil.virtual_memory()
memoryused= float(str(memoryused).replace(" ", "").split("percent=")[1].split(",")[0])
if memoryused >75.0:
BatchFound= 8
if memoryused >85.0:
BatchFound= 4
if memoryused >90.0:
BatchFound= 2
if total_params >100000000:
BatchFound= 1
print("Batch Size: "+ str(BatchFound)); gc.collect()
except:
pass
memoryused= []; total_params= []; GCPU= "";
del memoryused, total_params, GCPU; gc.collect()
return BatchFound
使用 pytorchsummary(pip 安装)或 keras(内置)提供的摘要。
例如
from torchsummary import summary
summary(model)
.....
.....
================================================================
Total params: 1,127,495
Trainable params: 1,127,495
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.02
Forward/backward pass size (MB): 13.93
Params size (MB): 4.30
Estimated Total Size (MB): 18.25
----------------------------------------------------------------
您放入批次中的每个实例都需要完整的 forward/backward 内存传递,您的模型只需要一次。人们似乎更喜欢2的幂的批量大小,可能是因为gpu上的自动布局优化。
不要忘记在增加批量大小时线性增加学习率。
假设我们手头有一台配备 16 GB 内存的 Tesla P100。
(16000 - model_size) / (forward_back_ward_size)
(16000 - 4.3) / 18.25 = 1148.29
rounded to powers of 2 results in batch size 1024
有时候我运行遇到一个问题:
OOM when allocating tensor with shape
e.q.
OOM when allocating tensor with shape (1024, 100, 160)
其中 1024 是我的批量大小,我不知道其余的是多少。如果我减少批量大小或模型中的神经元数量,它 运行 没问题。
是否有一种通用的方法可以根据模型和 GPU 内存计算最佳批量大小,从而使程序不会崩溃?
简而言之:就我的模型而言,我想要最大的批量大小,这将适合我的 GPU 内存并且不会使程序崩溃。
摘自 Goodfellow 等人最近出版的《深度学习》一书,chapter 8:
Minibatch sizes are generally driven by the following factors:
- Larger batches provide a more accurate estimate of the gradient, but with less than linear returns.
- Multicore architectures are usually underutilized by extremely small batches. This motivates using some absolute minimum batch size, below which there is no reduction in the time to process a minibatch.
- If all examples in the batch are to be processed in parallel (as is typically the case), then the amount of memory scales with the batch size. For many hardware setups this is the limiting factor in batch size.
- Some kinds of hardware achieve better runtime with specific sizes of arrays. Especially when using GPUs, it is common for power of 2 batch sizes to offer better runtime. Typical power of 2 batch sizes range from 32 to 256, with 16 sometimes being attempted for large models.
- Small batches can offer a regularizing effect (Wilson and Martinez, 2003), perhaps due to the noise they add to the learning process. Generalization error is often best for a batch size of 1. Training with such a small batch size might require a small learning rate to maintain stability because of the high variance in the estimate of the gradient. The total runtime can be very high as a result of the need to make more steps, both because of the reduced learning rate and because it takes more steps to observe the entire training set.
这在实践中通常意味着“2 的幂并且越大越好,前提是该批次适合您的 (GPU) 内存”。
您可能还想参考 Stack Exchange 中的几篇好文章:
- Tradeoff batch size vs. number of iterations to train a neural network
- How large should the batch size be for stochastic gradient descent?
请记住,Keskar 等人的论文。 'On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima', quoted by several of the posts above, has received some objections 深度学习社区的其他受人尊敬的研究人员。
希望这对您有所帮助...
更新(2017 年 12 月):
Yoshua Bengio 及其团队发表了一篇新论文,Three Factors Influencing Minima in SGD(2017 年 11 月);从某种意义上说,它值得一读,因为它报告了关于学习率和批量大小之间相互作用的新理论和实验结果。
更新(2021 年 3 月):
这里还有另一篇 2018 年的论文,Revisiting Small Batch Training for Deep Neural Networks(h/t 致 Nicolas Gervais),这与 越大越好 的建议相反;摘自摘要:
The best performance has been consistently obtained for mini-batch sizes between m=2 and m=32, which contrasts with recent work advocating the use of mini-batch sizes in the thousands.
您可以使用以下方法估算最大批量大小:
最大批量大小=可用GPU内存字节/4/(张量大小+可训练参数)
我 运行 遇到类似的 GPU 内存错误,该错误已通过使用以下配置 tensorflow 会话解决:
# See https://www.tensorflow.org/tutorials/using_gpu#allowing_gpu_memory_growth
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
参见:
这是一个函数,用于查找用于训练模型的批量大小:
def FindBatchSize(model):
"""model: model architecture, that is yet to be trained"""
import os, sys, psutil, gc, tensorflow, keras
import numpy as np
from keras import backend as K
BatchFound= 16
try:
total_params= int(model.count_params()); GCPU= "CPU"
#find whether gpu is available
try:
if K.tensorflow_backend._get_available_gpus()== []:
GCPU= "CPU"; #CPU and Cuda9GPU
else:
GCPU= "GPU"
except:
from tensorflow.python.client import device_lib; #Cuda8GPU
def get_available_gpus():
local_device_protos= device_lib.list_local_devices()
return [x.name for x in local_device_protos if x.device_type == 'GPU']
if "gpu" not in str(get_available_gpus()).lower():
GCPU= "CPU"
else:
GCPU= "GPU"
#decide batch size on the basis of GPU availability and model complexity
if (GCPU== "GPU") and (os.cpu_count() >15) and (total_params <1000000):
BatchFound= 64
if (os.cpu_count() <16) and (total_params <500000):
BatchFound= 64
if (GCPU== "GPU") and (os.cpu_count() >15) and (total_params <2000000) and (total_params >=1000000):
BatchFound= 32
if (GCPU== "GPU") and (os.cpu_count() >15) and (total_params >=2000000) and (total_params <10000000):
BatchFound= 16
if (GCPU== "GPU") and (os.cpu_count() >15) and (total_params >=10000000):
BatchFound= 8
if (os.cpu_count() <16) and (total_params >5000000):
BatchFound= 8
if total_params >100000000:
BatchFound= 1
except:
pass
try:
#find percentage of memory used
memoryused= psutil.virtual_memory()
memoryused= float(str(memoryused).replace(" ", "").split("percent=")[1].split(",")[0])
if memoryused >75.0:
BatchFound= 8
if memoryused >85.0:
BatchFound= 4
if memoryused >90.0:
BatchFound= 2
if total_params >100000000:
BatchFound= 1
print("Batch Size: "+ str(BatchFound)); gc.collect()
except:
pass
memoryused= []; total_params= []; GCPU= "";
del memoryused, total_params, GCPU; gc.collect()
return BatchFound
使用 pytorchsummary(pip 安装)或 keras(内置)提供的摘要。
例如
from torchsummary import summary
summary(model)
.....
.....
================================================================
Total params: 1,127,495
Trainable params: 1,127,495
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.02
Forward/backward pass size (MB): 13.93
Params size (MB): 4.30
Estimated Total Size (MB): 18.25
----------------------------------------------------------------
您放入批次中的每个实例都需要完整的 forward/backward 内存传递,您的模型只需要一次。人们似乎更喜欢2的幂的批量大小,可能是因为gpu上的自动布局优化。
不要忘记在增加批量大小时线性增加学习率。
假设我们手头有一台配备 16 GB 内存的 Tesla P100。
(16000 - model_size) / (forward_back_ward_size)
(16000 - 4.3) / 18.25 = 1148.29
rounded to powers of 2 results in batch size 1024