H2O 苏打水 - DNN mini_batch_size 参数

H2O sparkling water - DNN mini_batch_size parameter

我目前 运行 Spark 2.3.0sparkling-water 2.3.1。我通过查看 changelog that links to this 找到了底层 H2O 库的文档。所以显然它使用 H2O 3.18.

通过查看 DNN,我注意到缺少 batch_size 参数,但它提供了一个 mini_batch_size 参数,但实际上并没有记录在案。我找到的关于这个参数的唯一文档是 here,它指的是 H2O 2.4,我假设它仍然适用于我正在使用的版本(我不知道这个假设是否正确).

mini batch

The number of training data rows to be processed per iteration. Note that independent of this parameter, each row is used immediately to update the model with (online) stochastic gradient descent. The mini batch size controls the synchronization period between nodes in a distributed environment and the frequency at which scoring and model cancellation can happen. For example, if mini-batch is set to 10,000 on H2O running on 4 nodes, then each node will process 2,500 rows per iteration, sampling randomly from their local data. Then, model averaging between the nodes takes place, and scoring can happen (dependent on scoring interval and duty factor). Special values are 0 for one epoch per iteration and -1 for processing the maximum amount of data per iteration. If “replicate training data” is enabled, N epochs will be trained per iteration on N nodes, otherwise one epoch.

由此我解释批次大小实际上固定为 1,因为它始终执行在线梯度下降。

我也开始研究 H2O 的源代码以查看它的默认值是什么,而 AFAIU 默认参数包含在 this class.

来自line 1694:

// stochastic gradient descent: mini-batch size = 1
// batch gradient descent: mini-batch size = # training rows
public int _mini_batch_size = 1;

所以从评论看来它实际上并没有执行Online Gradient Descent,但它似乎实际上表现为batch size。如果我们假设 H2O 2.4 的文档仍然适用,则值 1 是无意义的。

进一步从 line 2173 那里设置用户给定的参数:

if (fromParms._mini_batch_size > 1) {
    Log.warn("_mini_batch_size", "Only mini-batch size = 1 is supported right now.");
    toParms._mini_batch_size = 1;

实际上我只是快速锁定了源代码,我可能遗漏了一些东西,但我真的无法理解 mini_batch_size 参数的工作原理以及它与批处理大小的关系。有人可以解释一下吗?

这个参数实际上不应该被用户使用,并且有一个隐藏它的票证here。现在请将 mini_batch_size 保留为 1(默认值),这样您就不会遇到任何警告或错误。