R 中 h2o 中跨模型的交叉验证

Cross-Validation Across models in h2o in R

我计划运行 glm、lasso 和 randomForest 跨越不同的预测变量集，看看哪种模型组合是最好的。我将进行 v-fold 交叉验证。为了一致地比较 ML 算法，必须将相同的折叠输入到每个 ML 算法中。如果我在这里错了，请纠正我。

我们如何在 R 的 h2o 包中实现这一点？我应该设置

fold_assignment = Modulo，例如 h2o.glm()、h2o.randomForest() 等
因此，训练集会在 ML 算法中以相同的方式分割吗？

如果我使用 fold_assignment = Modulo 如果我必须对结果进行分层怎么办？分层选项也带有 fold_assignment 参数？我不确定我是否可以同时指定 Modulo 和 Stratified。

或者，如果我在每个模型中设置相同的 seed，它们的折叠数是否与输入相同？

阅读了 [Darren Cook 的 Practical Machine Learning with H2O] 第 4 章后，我有上述问题 (https://www.oreilly.com/library/view/practical-machine-learning/9781491964590/ch04.html)

此外，对于以下引文中的场景中站点级数据的普遍性：

For example, if you have observations (e.g., user transactions) from K cities and you want to build models on users from only K-1 cities and validate them on the remaining city (if you want to study the generalization to new cities, for example), you will need to specify the parameter “fold_column” to be the city column. Otherwise, you will have rows (users) from all K cities randomly blended into the K folds, and all K cross-validation models will see all K cities, making the validation less useful (or totally wrong, depending on the distribution of the data). (source)

在那种情况下，因为我们按列交叉折叠，所以它在所有不同的模型中都是一致的，对吗？

确保为所有 ML 算法（相同种子）拆分相同的数据集。每个模型拥有相同的种子不一定具有相同的交叉验证分配。为确保它们是 apples-to-apples 比较，创建一个折叠列（.kfold_column() 或 .stratified_kfold_column()）并在训练期间指定它，以便它们都使用相同的折叠分配。