更大的批量大小是否会减少机器学习中的计算时间?

Will larger batch size make computation time less in machine learning?

我正在尝试调整超参数,即 批量大小 在 CNN.I 有一台 corei7 的计算机,RAM 12GB,我正在用 CIFAR 训练一个 CNN 网络-可以在 blog.

中找到 10 个数据集现在首先我阅读并了解了机器学习中的批量大小:

let's first suppose that we're doing online learning, i.e. that we're using a mini­batch size of 1. The obvious worry about online learning is that using mini­batches which contain just a single training example will cause significant errors in our estimate of the gradient. In fact, though, the errors turn out to not be such a problem. The reason is that the individual gradient estimates don't need to be super­accurate. All we need is an estimate accurate enough that our cost function tends to keep decreasing. It's as though you are trying to get to the North Magnetic Pole, but have a wonky compass that's 10­-20 degrees off each time you look at it. Provided you stop to check the compass frequently, and the compass gets the direction right on average, you'll end up at the North Magnetic Pole just fine.

Based on this argument, it sounds as though we should use online learning. In fact, the situation turns out to be more complicated than that.As we know we can use matrix techniques to compute the gradient update for all examples in a mini­batch simultaneously, rather than looping over them. Depending on the details of our hardware and linear algebra library this can make it quite a bit faster to compute the gradient estimate for a mini­batch of (for example) size 100 , rather than computing the mini­batch gradient estimate by looping over the 100 training examples separately. It might take (say) only 50 times as long, rather than 100 times as long.Now, at first it seems as though this doesn't help us that much.

With our mini­batch of size 100 the learning rule for the weights looks like:

where the sum is over training examples in the mini­batch. This is versus
for online learning. Even if it only takes 50 times as long to do the mini­batch update, it still seems likely to be better to do online learning, because we'd be updating so much more frequently. Suppose, however, that in the mini­batch case we increase the learning rate by a factor 100, so the update rule becomes

That's a lot like doing separate instances of online learning with a learning rate of η. But it only takes 50 times as long as doing a single instance of online learning. Still, it seems distinctly possible that using the larger mini­batch would speed things up.



现在我尝试使用 MNIST digit dataset 和 运行 示例程序并将批量大小 1 设置为 first.I 记下完整 dataset.Then 我增加了批量大小,我注意到它变得更快了。
但是如果用这个 code and github link 训练,改变批量大小不会减少训练 time.It 如果我使用 30 或 128 或者 64.They 说他们得到 92% accuracy.After 他们已经超过 40% accuracy.But 两个或三个纪元当我 运行 我的计算机中的代码除了批量大小之外没有改变任何东西时我在 10 之后得到更差的结果时代只有 28%,测试准确率在下一个 epochs.Then 停留在那里,我想因为他们使用了 128 的批量大小,我需要使用 that.Then 我使用了相同的但它变得更糟只给 11%在 10 个纪元之后卡在那里。为什么?

神经网络通过梯度下降学习权重 space 中的误差函数,该函数由训练示例参数化。这意味着变量是神经网络的权重。该函数是 "generic" 并且在您使用训练示例时变得具体。 "correct" 方法是使用所有训练示例来实现特定功能。这称为 "batch gradient descent",通常不这样做有两个原因:

  1. 它可能不适合你的 RAM(通常是 GPU,至于神经网络,当你使用 GPU 时你会得到巨大的提升)。
  2. 其实没有必要用到所有的例子。

在机器学习问题中,您通常有数千个训练示例。但是,当您只查看几个(例如 64、128 或 256)个示例时,误差表面可能看起来很相似。

将其视为一张照片:要了解照片的内容,您通常不需要 2500x1800 像素的分辨率。一张 256x256 像素的图片可以让您很好地了解照片的内容。但是,你错过了细节。

所以把梯度下降想象成在误差面上行走:你从一个点开始,你想找到最低点。为此,您走下去。然后你再次检查你的高度,检查它朝哪个方向下降并在那个方向上做一个 "step" (其大小由学习率和其他几个因素决定)。当你进行小批量训练而不是批量训练时,你会走到一个不同的误差面上。在低分辨率误差表面。它实际上可能会出现在 "real" 误差面上。但总的来说,你会朝着正确的方向前进。而且您可以使单个步骤更快!

现在,当您降低分辨率(批量较小)时会发生什么?

是的,您对错误表面的看法变得不太准确。这对您的影响有多大取决于以下因素:

  • 你的hardware/implementation
  • 数据集:误差表面有多复杂,仅用一小部分近似它有多好?
  • 学习:你到底是怎么学习的(momentum?newbob?rprop?)

我想补充一下这里已经说过的,更大的批量 并不总是 有利于泛化。我自己也见过这些情况,当批量大小的增加会损害验证准确性时,尤其是对于使用 CIFAR-10 数据集的 CNN。

来自"On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima"

The stochastic gradient descent (SGD) method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, say 32–512 data points, is sampled to compute an approximation to the gradient. It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize. We investigate the cause for this generalization drop in the large-batch regime and present numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions—and as is well known, sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation. We discuss several strategies to attempt to help large-batch methods eliminate this generalization gap.

底线:您应该调整批量大小,就像 一样,以找到最佳值。

2018 opinion retweeted by Yann LeCun is the paper Revisiting Small Batch Training For Deep Neural Networks, Dominic Masters and Carlo Luschi 建议好的通用最大批量大小是:

32

与学习率的选择有一些相互作用。

较早的 2016 年论文 On Large-batch Training For Deep Learning: Generalization Gap And Sharp Minima 给出了一些不使用大批量的原因,我对此的解释很糟糕,因为大批量可能会陷入局部(“尖锐”)最小值,而小批量则不会。