如何使用 JohnSnowLabs NLP 拼写校正模块 NorvigSweetingModel?
How to use JohnSnowLabs NLP Spell correction module NorvigSweetingModel?
我正在使用 JohnSnowLabs SpellChecker here。
我在那里找到了Norvig
的算法实现,示例部分只有以下两行:
import com.johnsnowlabs.nlp.annotator.NorvigSweetingModel
NorvigSweetingModel.pretrained()
我如何在下面的数据框 (df
) 上应用这个预训练模型来更正“names
”列的拼写?
+----------------+---+------------+
| names|age| color|
+----------------+---+------------+
| [abc, cde]| 19| red, abc|
|[eefg, efa, efb]|192|efg, efz efz|
+----------------+---+------------+
我试过如下:
val schk = NorvigSweetingModel.pretrained().setInputCols("names").setOutputCol("Corrected")
val cdf = schk.transform(df)
但是上面的代码给了我以下错误:
java.lang.IllegalArgumentException: requirement failed: Wrong or missing inputCols annotators in SPELL_a1f11bacb851. Received inputCols: names. Make sure such columns have following annotator types: token
at scala.Predef$.require(Predef.scala:224)
at com.johnsnowlabs.nlp.AnnotatorModel.transform(AnnotatorModel.scala:51)
... 49 elided
spark-nlp
设计用于其自己的特定管道,不同转换器的输入列必须包含特殊元数据。
异常已经告诉您 NorvigSweetingModel
的输入应该被标记化:
Make sure such columns have following annotator types: token
如果我没记错的话,您至少会有 assemble 个文档并在此处标记化。
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.NorvigSweetingModel
import com.johnsnowlabs.nlp.annotators.Tokenizer
import org.apache.spark.ml.Pipeline
val df = Seq(Seq("abc", "cde"), Seq("eefg", "efa", "efb")).toDF("names")
val nlpPipeline = new Pipeline().setStages(Array(
new DocumentAssembler().setInputCol("names").setOutputCol("document"),
new Tokenizer().setInputCols("document").setOutputCol("tokens"),
NorvigSweetingModel.pretrained().setInputCols("tokens").setOutputCol("corrected")
))
像这样的 Pipeline
,只需稍作调整即可应用于您的数据 - 输入数据必须是 string
而不是 array<string>
*:
val result = df
.transform(_.withColumn("names", concat_ws(" ", $"names")))
.transform(df => nlpPipeline.fit(df).transform(df))
result.show()
+------------+--------------------+--------------------+--------------------+
| names| document| tokens| corrected|
+------------+--------------------+--------------------+--------------------+
| abc cde|[[document, 0, 6,...|[[token, 0, 2, ab...|[[token, 0, 2, ab...|
|eefg efa efb|[[document, 0, 11...|[[token, 0, 3, ee...|[[token, 0, 3, ee...|
+------------+--------------------+--------------------+--------------------+
如果您想要可以导出的输出,您应该将 Pipeline
扩展为 Finisher
。
import com.johnsnowlabs.nlp.Finisher
new Finisher().setInputCols("corrected").transform(result).show
+------------+------------------+
| names|finished_corrected|
+------------+------------------+
| abc cde| [abc, cde]|
|eefg efa efb| [eefg, efa, efb]|
+------------+------------------+
* 根据 the docs DocumentAssembler
can read either a String column or an Array[String]
但它在 1.7.3 中看起来并不适用:
df.transform(df => nlpPipeline.fit(df).transform(df)).show()
org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(names)' due to data type mismatch: argument 1 requires string type, however, '`names`' is of array<string> type.;;
'Project [names#62, UDF(names#62) AS document#343]
+- AnalysisBarrier
+- Project [value#60 AS names#62]
+- LocalRelation [value#60]
我正在使用 JohnSnowLabs SpellChecker here。
我在那里找到了Norvig
的算法实现,示例部分只有以下两行:
import com.johnsnowlabs.nlp.annotator.NorvigSweetingModel
NorvigSweetingModel.pretrained()
我如何在下面的数据框 (df
) 上应用这个预训练模型来更正“names
”列的拼写?
+----------------+---+------------+
| names|age| color|
+----------------+---+------------+
| [abc, cde]| 19| red, abc|
|[eefg, efa, efb]|192|efg, efz efz|
+----------------+---+------------+
我试过如下:
val schk = NorvigSweetingModel.pretrained().setInputCols("names").setOutputCol("Corrected")
val cdf = schk.transform(df)
但是上面的代码给了我以下错误:
java.lang.IllegalArgumentException: requirement failed: Wrong or missing inputCols annotators in SPELL_a1f11bacb851. Received inputCols: names. Make sure such columns have following annotator types: token
at scala.Predef$.require(Predef.scala:224)
at com.johnsnowlabs.nlp.AnnotatorModel.transform(AnnotatorModel.scala:51)
... 49 elided
spark-nlp
设计用于其自己的特定管道,不同转换器的输入列必须包含特殊元数据。
异常已经告诉您 NorvigSweetingModel
的输入应该被标记化:
Make sure such columns have following annotator types: token
如果我没记错的话,您至少会有 assemble 个文档并在此处标记化。
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.NorvigSweetingModel
import com.johnsnowlabs.nlp.annotators.Tokenizer
import org.apache.spark.ml.Pipeline
val df = Seq(Seq("abc", "cde"), Seq("eefg", "efa", "efb")).toDF("names")
val nlpPipeline = new Pipeline().setStages(Array(
new DocumentAssembler().setInputCol("names").setOutputCol("document"),
new Tokenizer().setInputCols("document").setOutputCol("tokens"),
NorvigSweetingModel.pretrained().setInputCols("tokens").setOutputCol("corrected")
))
像这样的 Pipeline
,只需稍作调整即可应用于您的数据 - 输入数据必须是 string
而不是 array<string>
*:
val result = df
.transform(_.withColumn("names", concat_ws(" ", $"names")))
.transform(df => nlpPipeline.fit(df).transform(df))
result.show()
+------------+--------------------+--------------------+--------------------+
| names| document| tokens| corrected|
+------------+--------------------+--------------------+--------------------+
| abc cde|[[document, 0, 6,...|[[token, 0, 2, ab...|[[token, 0, 2, ab...|
|eefg efa efb|[[document, 0, 11...|[[token, 0, 3, ee...|[[token, 0, 3, ee...|
+------------+--------------------+--------------------+--------------------+
如果您想要可以导出的输出,您应该将 Pipeline
扩展为 Finisher
。
import com.johnsnowlabs.nlp.Finisher
new Finisher().setInputCols("corrected").transform(result).show
+------------+------------------+
| names|finished_corrected|
+------------+------------------+
| abc cde| [abc, cde]|
|eefg efa efb| [eefg, efa, efb]|
+------------+------------------+
* 根据 the docs DocumentAssembler
can read either a String column or an Array[String]
但它在 1.7.3 中看起来并不适用:
df.transform(df => nlpPipeline.fit(df).transform(df)).show()
org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(names)' due to data type mismatch: argument 1 requires string type, however, '`names`' is of array<string> type.;;
'Project [names#62, UDF(names#62) AS document#343]
+- AnalysisBarrier
+- Project [value#60 AS names#62]
+- LocalRelation [value#60]