神经网络回归的小批量大小的选择
Selection of Mini-batch Size for Neural Network Regression
我正在做一个具有 4 个特征的神经网络回归。如何确定我的问题的小批量大小?我看到人们对计算机视觉使用 100 ~ 1000 个批量大小,每个图像具有 32*32*3 个特征,这是否意味着我应该使用 100 万个批量大小?我有几十亿的数据和几十GB的内存,所以没有硬性要求我不这样做。
我还观察到使用大小为 ~ 1000 的小批量使收敛速度比 100 万的批量快得多。我觉得应该是反过来的,因为batch size越大计算出的梯度最能代表整个样本的梯度?为什么使用 mini-batch 收敛更快?
来自Tradeoff batch size vs. number of iterations to train a neural network:
来自 Nitish Shirish Keskar、Dheevatsa Mudigere、Jorge Nocedal、Mikhail Smelyanskiy、Ping Tak Peter Tang。关于深度学习的大批量训练:泛化差距和夏普极小值。 https://arxiv.org/abs/1609.04836 :
The stochastic gradient descent method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, usually 32--512 data points, is sampled to compute an approximation to the gradient. It has been observed in practice that when using a larger batch there is a significant degradation in the quality of the model, as measured by its ability to generalize. There have been some attempts to investigate the cause for this generalization drop in the large-batch regime, however the precise answer for this phenomenon is, hitherto unknown. In this paper, we present ample numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions -- and that sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation. We also discuss several empirical strategies that help large-batch methods eliminate the generalization gap and conclude with a set of future research ideas and open questions.
[…]
The lack of generalization ability is due to the fact that large-batch methods tend to converge to sharp minimizers of the training function. These minimizers are characterized by large positive eigenvalues in $\nabla^2 f(x)$ and tend to generalize less well. In contrast, small-batch methods converge to flat minimizers characterized by small positive eigenvalues of $\nabla^2 f(x)$. We have observed that the loss function landscape of deep neural networks is such that large-batch methods are almost invariably attracted to regions with sharp minima and that, unlike small batch methods, are unable to escape basins of these minimizers.
[…]
此外,来自 Ian Goodfellow 的一些很好的见解
回答 为什么不使用整个训练集来计算梯度?
在 Quora 上:
The size of the learning rate is limited mostly by factors like how
curved the cost function is. You can think of gradient descent as
making a linear approximation to the cost function, then moving
downhill along that approximate cost. If the cost function is highly
non-linear (highly curved) then the approximation will not be very
good for very far, so only small step sizes are safe. You can read
more about this in Chapter 4 of the deep learning textbook, on
numerical computation:
http://www.deeplearningbook.org/contents/numerical.html
When you put
m examples in a minibatch, you need to do O(m) computation and use
O(m) memory, but you reduce the amount of uncertainty in the gradient
by a factor of only O(sqrt(m)). In other words, there are diminishing
marginal returns to putting more examples in the minibatch. You can
read more about this in Chapter 8 of the deep learning textbook, on
optimization algorithms for deep learning:
http://www.deeplearningbook.org/contents/optimization.html
Also, if
you think about it, even using the entire training set doesn’t really
give you the true gradient. The true gradient would be the expected
gradient with the expectation taken over all possible examples,
weighted by the data generating distribution. Using the entire
training set is just using a very large minibatch size, where the size
of your minibatch is limited by the amount you spend on data
collection, rather than the amount you spend on computation.
我正在做一个具有 4 个特征的神经网络回归。如何确定我的问题的小批量大小?我看到人们对计算机视觉使用 100 ~ 1000 个批量大小,每个图像具有 32*32*3 个特征,这是否意味着我应该使用 100 万个批量大小?我有几十亿的数据和几十GB的内存,所以没有硬性要求我不这样做。
我还观察到使用大小为 ~ 1000 的小批量使收敛速度比 100 万的批量快得多。我觉得应该是反过来的,因为batch size越大计算出的梯度最能代表整个样本的梯度?为什么使用 mini-batch 收敛更快?
来自Tradeoff batch size vs. number of iterations to train a neural network:
来自 Nitish Shirish Keskar、Dheevatsa Mudigere、Jorge Nocedal、Mikhail Smelyanskiy、Ping Tak Peter Tang。关于深度学习的大批量训练:泛化差距和夏普极小值。 https://arxiv.org/abs/1609.04836 :
The stochastic gradient descent method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, usually 32--512 data points, is sampled to compute an approximation to the gradient. It has been observed in practice that when using a larger batch there is a significant degradation in the quality of the model, as measured by its ability to generalize. There have been some attempts to investigate the cause for this generalization drop in the large-batch regime, however the precise answer for this phenomenon is, hitherto unknown. In this paper, we present ample numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions -- and that sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation. We also discuss several empirical strategies that help large-batch methods eliminate the generalization gap and conclude with a set of future research ideas and open questions.
[…]
The lack of generalization ability is due to the fact that large-batch methods tend to converge to sharp minimizers of the training function. These minimizers are characterized by large positive eigenvalues in $\nabla^2 f(x)$ and tend to generalize less well. In contrast, small-batch methods converge to flat minimizers characterized by small positive eigenvalues of $\nabla^2 f(x)$. We have observed that the loss function landscape of deep neural networks is such that large-batch methods are almost invariably attracted to regions with sharp minima and that, unlike small batch methods, are unable to escape basins of these minimizers.
[…]
此外,来自 Ian Goodfellow 的一些很好的见解 回答 为什么不使用整个训练集来计算梯度? 在 Quora 上:
The size of the learning rate is limited mostly by factors like how curved the cost function is. You can think of gradient descent as making a linear approximation to the cost function, then moving downhill along that approximate cost. If the cost function is highly non-linear (highly curved) then the approximation will not be very good for very far, so only small step sizes are safe. You can read more about this in Chapter 4 of the deep learning textbook, on numerical computation: http://www.deeplearningbook.org/contents/numerical.html
When you put m examples in a minibatch, you need to do O(m) computation and use O(m) memory, but you reduce the amount of uncertainty in the gradient by a factor of only O(sqrt(m)). In other words, there are diminishing marginal returns to putting more examples in the minibatch. You can read more about this in Chapter 8 of the deep learning textbook, on optimization algorithms for deep learning: http://www.deeplearningbook.org/contents/optimization.html
Also, if you think about it, even using the entire training set doesn’t really give you the true gradient. The true gradient would be the expected gradient with the expectation taken over all possible examples, weighted by the data generating distribution. Using the entire training set is just using a very large minibatch size, where the size of your minibatch is limited by the amount you spend on data collection, rather than the amount you spend on computation.