机器学习调优 - Spark 中的交叉验证
ML Tuning - Cross Validation in Spark
我正在查看 https://spark.apache.org/docs/latest/ml-tuning.html#cross-validation
中的交叉验证代码示例
它说:
CrossValidator
begins by splitting the dataset into a set of folds which are used as separate training and test datasets. E.g., with k=3 folds, CrossValidator
will generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data for training and 1/3 for testing.
所以我不明白为什么在代码中数据在训练和测试中是分开的:
// Run cross-validation, and choose the best set of parameters.
val cvModel = cv.fit(training)
// Prepare test documents, which are unlabeled (id, text) tuples.
val test = spark.createDataFrame(Seq(
(4L, "spark i j k"),
(5L, "l m n"),
(6L, "mapreduce spark"),
(7L, "apache hadoop")
)).toDF("id", "text")
// Make predictions on test documents. cvModel uses the best model found (lrModel).
cvModel.transform(test)
.select("id", "text", "probability", "prediction")
.collect()
.foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) =>
println(s"($id, $text) --> prob=$prob, prediction=$prediction")
}
是否可以在不分离数据的情况下应用交叉验证并获得预测?
数据分为 training
和 test
以防止使用相同的数据(用于调整超参数)再次评估生成模型的性能。这是为了避免根据训练模型的数据评估模型,因为那样你会过于乐观。
也许将 test
视为“验证”数据集会有所帮助,因为 training
在每个 [=14= 中被分成 2/3 的训练数据和 1/3 的测试数据] 折叠。
这是一个good explanation on nested cross-validation
另请参阅 this question 以更好地解释为什么将数据分成 3 组可能有意义。
我正在查看 https://spark.apache.org/docs/latest/ml-tuning.html#cross-validation
中的交叉验证代码示例它说:
CrossValidator
begins by splitting the dataset into a set of folds which are used as separate training and test datasets. E.g., with k=3 folds,CrossValidator
will generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data for training and 1/3 for testing.
所以我不明白为什么在代码中数据在训练和测试中是分开的:
// Run cross-validation, and choose the best set of parameters.
val cvModel = cv.fit(training)
// Prepare test documents, which are unlabeled (id, text) tuples.
val test = spark.createDataFrame(Seq(
(4L, "spark i j k"),
(5L, "l m n"),
(6L, "mapreduce spark"),
(7L, "apache hadoop")
)).toDF("id", "text")
// Make predictions on test documents. cvModel uses the best model found (lrModel).
cvModel.transform(test)
.select("id", "text", "probability", "prediction")
.collect()
.foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) =>
println(s"($id, $text) --> prob=$prob, prediction=$prediction")
}
是否可以在不分离数据的情况下应用交叉验证并获得预测?
数据分为 training
和 test
以防止使用相同的数据(用于调整超参数)再次评估生成模型的性能。这是为了避免根据训练模型的数据评估模型,因为那样你会过于乐观。
也许将 test
视为“验证”数据集会有所帮助,因为 training
在每个 [=14= 中被分成 2/3 的训练数据和 1/3 的测试数据] 折叠。
这是一个good explanation on nested cross-validation
另请参阅 this question 以更好地解释为什么将数据分成 3 组可能有意义。