如何在不创建无数中间数据帧的情况下应用多个索引器和编码器？

Question

这是我的代码：

val workindexer = new StringIndexer().setInputCol("workclass").setOutputCol("workclassIndex")
val workencoder = new OneHotEncoder().setInputCol("workclassIndex").setOutputCol("workclassVec")

val educationindexer = new StringIndexer().setInputCol("education").setOutputCol("educationIndex")
val educationencoder = new OneHotEncoder().setInputCol("educationIndex").setOutputCol("educationVec")

val maritalindexer = new StringIndexer().setInputCol("marital_status").setOutputCol("maritalIndex")
val maritalencoder = new OneHotEncoder().setInputCol("maritalIndex").setOutputCol("maritalVec")

val occupationindexer = new StringIndexer().setInputCol("occupation").setOutputCol("occupationIndex")
val occupationencoder = new OneHotEncoder().setInputCol("occupationIndex").setOutputCol("occupationVec")

val relationindexer = new StringIndexer().setInputCol("relationship").setOutputCol("relationshipIndex")
val relationencoder = new OneHotEncoder().setInputCol("relationshipIndex").setOutputCol("relationshipVec")

val raceindexer = new StringIndexer().setInputCol("race").setOutputCol("raceIndex")
val raceencoder = new OneHotEncoder().setInputCol("raceIndex").setOutputCol("raceVec")

val sexindexer = new StringIndexer().setInputCol("sex").setOutputCol("sexIndex")
val sexencoder = new OneHotEncoder().setInputCol("sexIndex").setOutputCol("sexVec")

val nativeindexer = new StringIndexer().setInputCol("native_country").setOutputCol("native_countryIndex")
val nativeencoder = new OneHotEncoder().setInputCol("native_countryIndex").setOutputCol("native_countryVec")

val labelindexer = new StringIndexer().setInputCol("label").setOutputCol("labelIndex")

有没有办法在不创建无数中间数据帧的情况下应用所有这些编码器和索引器？

Answer 1

我会使用 RFormula:

import org.apache.spark.ml.feature.RFormula

val features = Seq("workclass", "education", 
   "marital_status", "occupation", "relationship", 
   "race", "sex", "native", "country")

val formula = new RFormula().setFormula(s"label ~ ${features.mkString(" + ")}")

它将应用与示例中使用的索引器相同的转换和 assemble 特征 Vector。

Answer 2

使用名为 ML Pipelines 的 Spark MLlib 功能：

ML Pipelines provide a uniform set of high-level APIs built on top of DataFrames that help users create and tune practical machine learning pipelines.

使用 ML 管道，您可以 "chain"（或 "pipe"）"encoders and indexers without creating countless intermediate dataframes".

import org.apache.spark.ml._
val pipeline = new Pipeline().setStages(Array(workindexer, workencoder...))

如何在不创建无数中间数据帧的情况下应用多个索引器和编码器？

How to apply several Indexers and Encoders without creating countless intermediate DataFrames?

scala

apache-spark

apache-spark-mllib