在管道中混合 Smark MLLIB 和 SparkNLP

Question

在 MLLIB 管道中，如何在 Stemmer（来自 Spark NLP）之后链接 CountVectorizer（来自 SparkML）？

当我尝试在管道中同时使用两者时，我得到：

myColName must be of type equal to one of the following types: [array<string>, array<string>] but was actually of type array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>.

此致，

Answer 1

您需要在 Spark NLP 管道中添加一个 Finisher。试试看：

  val documentAssembler =
    new DocumentAssembler().setInputCol("text").setOutputCol("document")
  val sentenceDetector =
    new SentenceDetector().setInputCols("document").setOutputCol("sentences")
  val tokenizer =
    new Tokenizer().setInputCols("sentences").setOutputCol("token")
  val stemmer = new Stemmer()
    .setInputCols("token")
    .setOutputCol("stem")

  val finisher = new Finisher()
    .setInputCols("stem")
    .setOutputCols("token_features")
    .setOutputAsArray(true)
    .setCleanAnnotations(false)

  val cv = new CountVectorizer()
    .setInputCol("token_features")
    .setOutputCol("features")

  val pipeline = new Pipeline()
    .setStages(
      Array(
        documentAssembler,
        sentenceDetector,
        tokenizer,
        stemmer,
        finisher,
        cv
      ))

val data =
  Seq("Peter Pipers employees are picking pecks of pickled peppers.")
    .toDF("text")

val model = pipeline.fit(data)
val df = model.transform(data)

输出：

+--------------------------------------------------------------------+
|features                                                            |
+--------------------------------------------------------------------+
|(10,[0,1,2,3,4,5,6,7,8,9],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])|
+--------------------------------------------------------------------+

在管道中混合 Smark MLLIB 和 SparkNLP

Mix Smark MLLIB and SparkNLP in pipeline

scala

apache-spark

apache-spark-mllib

johnsnowlabs-spark-nlp