XGBoost 生成的树没有 num_round 参数中指定的那么多

XGBoost doesn't generate as many tree as specified in the num_round parameter

这不是错误,而是需要理解的问题。当我从 Booster 对象调用 getModelDump 时,我得到的树没有 "num_round" 参数中的那么多。我在想如果 "num_round" 是 100,那么 XGBoost 将按顺序生成 100 棵树,当我调用 getModelDump 时我会看到所有这些树。我确信背后有一个合乎逻辑的原因,或者我的知识是错误的。你能解释一下这种情况吗?

val paramMap = List(
      "eta" -> 0.1, "max_depth" -> 7, "objective" -> "binary:logistic", "num_round" ->100,
      "eval_metric" -> "auc", "nworkers" -> 8).toMap
    val xgboostEstimator = new XGBoostEstimator(paramMap)
//TrainModel is another set of standard Spark features like StringIndexer, OnehotEncoding and VectorAssembler
    val pipelineXGBoost = new Pipeline().setStages(Array(trainModel, xgboostEstimator))
    val cvModel = pipelineXGBoost.fit(train)
//Below call generates only 2 tree instead of 100 as num_round is 100!!!
    println(cvModel.stages(1).asInstanceOf[XGBoostClassificationModel].booster.getModelDump()(0))

Github link 到问题 https://github.com/dmlc/xgboost/issues/2610

使用 scala 2.11 的版本如下

  "ml.dmlc" % "xgboost4j" % "0.7",
  "ml.dmlc" % "xgboost4j-spark" % "0.7",
  "org.apache.spark" %% "spark-core" % "2.2.0",
  "org.apache.spark" %% "spark-sql" % "2.2.0",
  "org.apache.spark" %% "spark-graphx" % "2.2.0",
  "org.apache.spark" %% "spark-mllib" % "2.2.0",

我没有从 getModelDump 的结果中得到 (0.. num_round)。每个索引对应另一棵树。

已在 link https://github.com/dmlc/xgboost/issues/2610

中回答