关于 spark nlp using scala 的错误

Question

我是 spark-nlp 的初学者，我正在通过 johnsnowlabs 中的示例学习它。我在数据块中使用 SCALA。

当我按照下面的示例进行操作时，

import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler().
    setInputCol("text").
    setOutputCol("document")

val regexTokenizer = new Tokenizer().
    setInputCols(Array("sentence")).
    setOutputCol("token")
val sentenceDetector = new SentenceDetector().
    setInputCols(Array("document")).
    setOutputCol("sentence")

val finisher = new Finisher()
    .setInputCols("token")
    .setIncludeMetadata(true)


finisher.withColumn("newCol", explode(arrays_zip($"finished_token", $"finished_ner")))

当我运行最后一行时出现以下错误：

command-786892578143744:2: error: value withColumn is not a member of com.johnsnowlabs.nlp.Finisher
finisher.withColumn("newCol", explode(arrays_zip($"finished_token", $"finished_ner")))

这可能是什么原因？

当我尝试做这个例子时，通过省略这一行，我添加了以下额外的代码行

val pipeline = new Pipeline().
    setStages(Array(
        documentAssembler,
        sentenceDetector,
        regexTokenizer,
        finisher
    ))

val data1 = Seq("hello, this is an example sentence").toDF("text")

pipeline.fit(data1).transform(data1).toDF("text")

我在运行最后一行时遇到另一个错误：

java.lang.IllegalArgumentException: requirement failed: The number of columns doesn't match.

谁能帮我解决这个问题？

谢谢

Answer 1

我认为你有两个问题， 1. 首先，您尝试将 withColumn 应用于注释器，您应该改为在数据框上进行。 2. 我认为这是转换后来自 toDF() 的问题。您需要更多列，而您只提供 1 个。也可能您根本不需要那个 toDF()。

阿尔贝托。

Answer 2

你的代码应该是这样的，首先构建管道：

import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler().
    setInputCol("text").
    setOutputCol("document")

val regexTokenizer = new Tokenizer().
    setInputCols(Array("sentence")).
    setOutputCol("token")
val sentenceDetector = new SentenceDetector().
    setInputCols(Array("document")).
    setOutputCol("sentence")

val finisher = new Finisher()
    .setInputCols("token")
    .setIncludeMetadata(true)

val pipeline = new Pipeline().
    setStages(Array(
        documentAssembler,
        sentenceDetector,
        regexTokenizer,
        finisher
    ))

创建一个简单的 DataFrame 进行测试：

val data1 = Seq("hello, this is an example sentence").toDF("text")

现在我们在此管道上拟合并转换您的 DataFrame：

val prediction = pipeline.fit(data1).transform(data1)

变量prediction是一个DataFrame，您可以在其中展开标记列。让我们看看 prediction DataFrame:

scala> prediction.show
+--------------------+--------------------+-----------------------+
|                text|      finished_token|finished_token_metadata|
+--------------------+--------------------+-----------------------+
|hello, this is an...|[hello, ,, this, ...|   [[sentence, 0], [...|
+--------------------+--------------------+-----------------------+

scala> prediction.withColumn("newCol", explode($"finished_token")).show
+--------------------+--------------------+-----------------------+--------+
|                text|      finished_token|finished_token_metadata|  newCol|
+--------------------+--------------------+-----------------------+--------+
|hello, this is an...|[hello, ,, this, ...|   [[sentence, 0], [...|   hello|
|hello, this is an...|[hello, ,, this, ...|   [[sentence, 0], [...|       ,|
|hello, this is an...|[hello, ,, this, ...|   [[sentence, 0], [...|    this|
|hello, this is an...|[hello, ,, this, ...|   [[sentence, 0], [...|      is|
|hello, this is an...|[hello, ,, this, ...|   [[sentence, 0], [...|      an|
|hello, this is an...|[hello, ,, this, ...|   [[sentence, 0], [...| example|
|hello, this is an...|[hello, ,, this, ...|   [[sentence, 0], [...|sentence|
+--------------------+--------------------+-----------------------+--------+

Alberto 提到的第一个问题，认为 finisher 是一个 DataFrame。它是一个注释器，直到它被转换。
第二个问题是在不需要的地方使用 .toDF()。（管道改造后）
你的 explode 函数在一个糟糕的地方，你正在压缩你的管道中甚至不存在的列：ner

请随时提出任何问题，我会相应地更新答案。

关于 spark nlp using scala 的错误

About an error regarding spark nlp using scala

scala

apache-spark

databricks

johnsnowlabs-spark-nlp