Spark CrossValidator如何确定如何应用网格参数

Question

ML Tuning https://spark.apache.org/docs/latest/ml-tuning.html 的 Spark 文档中的以下片段显然为 Hashing TermFrequency 设置了 numFeatures，为 regParam（正则化）设置了LogisticRegression 型号：

HashingTF和LogisticRegression：

val hashingTF = new HashingTF()
  .setInputCol(tokenizer.getOutputCol)
  .setOutputCol("features")
val lr = new LogisticRegression()
  .setMaxIter(10)

ParamGridBuilder为CrossValidator:

// We use a ParamGridBuilder to construct a grid of parameters to search over.
// With 3 values for hashingTF.numFeatures and 2 values for lr.regParam,
// this grid will have 3 x 2 = 6 parameter settings for CrossValidator to choose from.
val paramGrid = new ParamGridBuilder()
  .addGrid(hashingTF.numFeatures, Array(10, 100, 1000))
  .addGrid(lr.regParam, Array(0.1, 0.01))
  .build()

CrossValidator "know" 如何将网格值应用到各个实体？我想看看它是否是通过反射但不清楚。

可能由 `CrossValidator 设置的方法是：

HashingTF:

  /** @group setParam */
  @Since("1.2.0")
  def setNumFeatures(value: Int): this.type = set(numFeatures, value)

逻辑回归:

class LogisticRegressionModel @Since("1.3.0") (
 ..
 @Since("1.3.0") val numFeatures: Int,

这是对 CrossValidator 的调用：

val cv = new CrossValidator()
  .setEstimator(pipeline)
  .setEvaluator(new BinaryClassificationEvaluator)
  .setEstimatorParamMaps(paramGrid)
  .setNumFolds(2)  // Use 3+ in practice

我无法确定 setEstimatorParamMaps 如何正确设置 HashingTF 和 LogisticRegression 值。（注意这个确实有效！）

这个问题的原因是我想添加一个新的 Evaluator 并且不确定如何将其与 CrossValidator 功能相匹配。

一个具体的例子：对于LDAModel：我们有调整参数k、vocabSize和docConcentration：ParamGrid应该如何设置对于那些？

Answer 1

A specific example: for LDAModel: we have tuning parameters k, vocabSize, and docConcentration : how should the ParamGrid be set up for those?

addGrid 采用 Param 和 Array 的兼容值。通常它设置在 Estimator (LDA) 而不是 Transformer (LDAModel`).

要设置k，docConcentration只需按照类型：

val lda = new LDA()

val paramGrid = new ParamGridBuilder()
 .addGrid(lda.k, Array(3, 5, 7))
 .addGrid(lda.docConcentration, Array(Array(0.1, 0.4, 0.5)))
 .build()

we have tuning parameters (...) vocabSize

词汇大小由输入向量定义。无法调整。

How does the CrossValidator "know" how to apply the grid values to the respective entities?

模型提供 fit 方法，该方法采用 dataset 和 ParamMap。 For example LDA:

def fit(dataset: Dataset[_], paramMap: ParamMap): LDAModel
Fits a single model to the input data with provided parameter map.

此变体 is used 来自 CrossValidator。

Spark CrossValidator如何确定如何应用网格参数

How does Spark CrossValidator determine how to apply the grid parameters

scala

apache-spark

apache-spark-mllib