创建复合变压器火花

Create composite transformer spark

我正在使用 NGram Transformer,然后是 CountVectorizerModel

我需要能够创建一个复合转换器以供以后重用。

我能够通过创建一个 List<Transformer> 并遍历所有元素来实现这一点,但我想知道是否可以使用另外 2 个 Transformer[创建一个 Transformer

这实际上非常简单,您只需要使用 Pipeline API 创建您的管道 :

import java.util.Arrays;

import org.apache.spark.ml.Pipeline;
import org.apache.spark.ml.PipelineModel;
import org.apache.spark.ml.PipelineStage;
import org.apache.spark.ml.feature.CountVectorizer;
import org.apache.spark.ml.feature.NGram;
import org.apache.spark.ml.feature.Tokenizer;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.Metadata;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;

List<Row> data = Arrays.asList(
            RowFactory.create(0, "Hi I heard about Spark"),
            RowFactory.create(1, "I wish Java could use case classes"),
            RowFactory.create(2, "Logistic,regression,models,are,neat")
    );

StructType schema = new StructType(new StructField[]{
            new StructField("id", DataTypes.IntegerType, false, Metadata.empty()),
            new StructField("sentence", DataTypes.StringType, false, Metadata.empty())
});

现在让我们定义我们的管道(分词器、ngram 转换器和计数向量化器):

Tokenizer tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words");

NGram ngramTransformer = NGram().setN(2).setInputCol("words").setOutputCol("ngrams");

CountVectorizer countVectorizer = new CountVectorizer()
  .setInputCol("ngrams")
  .setOutputCol("feature")
  .setVocabSize(3)
  .setMinDF(2);

我们现在可以创建管道并训练它了:

Pipeline pipeline = new Pipeline()
            .setStages(new PipelineStage[]{tokenizer, ngramTransformer, countVectorizer});

// Fit the pipeline to training documents.
PipelineModel model = pipeline.fit(sentenceDataFrame);

希望对您有所帮助