机器学习调优 - Spark 中的交叉验证

Question

我正在查看 https://spark.apache.org/docs/latest/ml-tuning.html#cross-validation

中的交叉验证代码示例

它说：

CrossValidator begins by splitting the dataset into a set of folds which are used as separate training and test datasets. E.g., with k=3 folds, CrossValidator will generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data for training and 1/3 for testing.

所以我不明白为什么在代码中数据在训练和测试中是分开的：

// Run cross-validation, and choose the best set of parameters.
val cvModel = cv.fit(training)

// Prepare test documents, which are unlabeled (id, text) tuples.
val test = spark.createDataFrame(Seq(
  (4L, "spark i j k"),
  (5L, "l m n"),
  (6L, "mapreduce spark"),
  (7L, "apache hadoop")
)).toDF("id", "text")

// Make predictions on test documents. cvModel uses the best model found (lrModel).
cvModel.transform(test)
  .select("id", "text", "probability", "prediction")
  .collect()
  .foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) =>
    println(s"($id, $text) --> prob=$prob, prediction=$prediction")
  }

是否可以在不分离数据的情况下应用交叉验证并获得预测？

Answer 1

数据分为 training 和 test 以防止使用相同的数据（用于调整超参数）再次评估生成模型的性能。这是为了避免根据训练模型的数据评估模型，因为那样你会过于乐观。

也许将 test 视为“验证”数据集会有所帮助，因为 training 在每个 [=14= 中被分成 2/3 的训练数据和 1/3 的测试数据] 折叠。

这是一个good explanation on nested cross-validation

另请参阅 this question 以更好地解释为什么将数据分成 3 组可能有意义。

机器学习调优 - Spark 中的交叉验证

ML Tuning - Cross Validation in Spark

machine-learning

apache-spark

apache-spark-mllib