我们应该如何在 Spark-NLP 中使用 setDictionary 作为词形还原注释器?
How should we use the setDictionary for the lemmatization annotator in Spark-NLP?
我有一个要求,我必须在词形还原步骤中添加字典。尝试在管道中使用它并执行 pipeline.fit() 时,我收到 arrayIndexOutOfBounds 异常。
实现这个的正确方法是什么?有例子吗?
我将令牌作为词形还原的输入列传递,将引理作为输出列传递。以下是我的代码:
// DocumentAssembler annotator
val document = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
// SentenceDetector annotator
val sentenceDetector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
// tokenizer annotaor
val token = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
import com.johnsnowlabs.nlp.util.io.ExternalResource
// lemmatizer annotator
val lemmatizer = new Lemmatizer()
.setInputCols(Array("token"))
.setOutputCol("lemma")
.setDictionary(ExternalResource("C:/data/notebook/lemmas001.txt","LINE_BY_LINE",Map("keyDelimiter"->",","valueDelimiter"->"|")))
val pipeline = new Pipeline().setStages(Array(document,sentenceDetector,token,lemmatizer))
val result= pipeline.fit(df).transform(df)
错误信息是:
Name: java.lang.ArrayIndexOutOfBoundsException
Message: 1
StackTrace: at com.johnsnowlabs.nlp.util.io.ResourceHelper$$anonfun$flattenRevertValuesAsKeys$$anonfun$apply.apply(ResourceHelper.scala:315)
at com.johnsnowlabs.nlp.util.io.ResourceHelper$$anonfun$flattenRevertValuesAsKeys$$anonfun$apply.apply(ResourceHelper.scala:312)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at com.johnsnowlabs.nlp.util.io.ResourceHelper$$anonfun$flattenRevertValuesAsKeys.apply(ResourceHelper.scala:312)
at com.johnsnowlabs.nlp.util.io.ResourceHelper$$anonfun$flattenRevertValuesAsKeys.apply(ResourceHelper.scala:312)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at com.johnsnowlabs.nlp.util.io.ResourceHelper$.flattenRevertValuesAsKeys(ResourceHelper.scala:312)
at com.johnsnowlabs.nlp.annotators.Lemmatizer.train(Lemmatizer.scala:52)
at com.johnsnowlabs.nlp.annotators.Lemmatizer.train(Lemmatizer.scala:19)
at com.johnsnowlabs.nlp.AnnotatorApproach.fit(AnnotatorApproach.scala:45)
at org.apache.spark.ml.Pipeline$$anonfun$fit.apply(Pipeline.scala:153)
at org.apache.spark.ml.Pipeline$$anonfun$fit.apply(Pipeline.scala:149)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:44)
at scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:37)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:149)
我觉得你的管道很好,所以一切都取决于里面的东西 lemmas001.txt
以及你是否能够在 Windows.
上访问它
注意:我看到 Windows 上的用户在 Apache Spark 中使用这个:
"C:\Users\something\Desktop\someDirectory\somefile.txt"
如何在 Spark NLP 中训练 Lemmatizer
很简单:
val lemmatizer = new Lemmatizer()
.setInputCols(Array("token"))
.setOutputCol("lemma")
.setDictionary("AntBNC_lemmas_ver_001.txt", "->", "\t")
文件必须具有以下格式,其中 keyDelimiter
在本例中为 ->
,valueDelimiter
为 \t
:
abnormal -> abnormal abnormals
abode -> abode abodes
abolish -> abolishing abolished abolish abolishes
abolitionist -> abolitionist abolitionists
abominate -> abominate abominated abominates
abomination -> abomination abominations
aboriginal -> aboriginal aboriginals
aborigine -> aborigines aborigine
abort -> aborted abort aborts aborting
abortifacient -> abortifacients abortifacient
abortionist -> abortionist abortionists
abortion -> abortion abortions
abo -> abo abos
abotrite -> abotrites abotrite
abound -> abound abounds abounding abounded
此外,如果您不想训练自己的 Lemmatizer,您可以使用如下预训练模型:
英语
val lemmatizer = new LemmatizerModel.pretrained(name="lemma_antbnc", lang="en")
.setInputCols(Array("token"))
.setOutputCol("lemma")
法语
val lemmatizer = new LemmatizerModel.pretrained(name="lemma", lang="fr")
.setInputCols(Array("token"))
.setOutputCol("lemma")
意大利语
val lemmatizer = new LemmatizerModel.pretrained(name="lemma", lang="it")
.setInputCols(Array("token"))
.setOutputCol("lemma")
德语
val lemmatizer = new LemmatizerModel.pretrained(name="lemma", lang="de")
.setInputCols(Array("token"))
.setOutputCol("lemma")
这里是所有预训练模型的列表:
https://nlp.johnsnowlabs.com/docs/en/models
所有预训练管道的列表在这里:
https://nlp.johnsnowlabs.com/docs/en/pipelines
如果您有更多问题,请在评论中告诉我。
完全公开:我是Spark NLP库的贡献者之一。
更新:如果您有兴趣,我在 Scala on Databricks 中为您找到了这个示例(这实际上是他们训练 Italian Lemmatizer 模型的方式)
我有一个要求,我必须在词形还原步骤中添加字典。尝试在管道中使用它并执行 pipeline.fit() 时,我收到 arrayIndexOutOfBounds 异常。 实现这个的正确方法是什么?有例子吗?
我将令牌作为词形还原的输入列传递,将引理作为输出列传递。以下是我的代码:
// DocumentAssembler annotator
val document = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
// SentenceDetector annotator
val sentenceDetector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
// tokenizer annotaor
val token = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
import com.johnsnowlabs.nlp.util.io.ExternalResource
// lemmatizer annotator
val lemmatizer = new Lemmatizer()
.setInputCols(Array("token"))
.setOutputCol("lemma")
.setDictionary(ExternalResource("C:/data/notebook/lemmas001.txt","LINE_BY_LINE",Map("keyDelimiter"->",","valueDelimiter"->"|")))
val pipeline = new Pipeline().setStages(Array(document,sentenceDetector,token,lemmatizer))
val result= pipeline.fit(df).transform(df)
错误信息是:
Name: java.lang.ArrayIndexOutOfBoundsException
Message: 1
StackTrace: at com.johnsnowlabs.nlp.util.io.ResourceHelper$$anonfun$flattenRevertValuesAsKeys$$anonfun$apply.apply(ResourceHelper.scala:315)
at com.johnsnowlabs.nlp.util.io.ResourceHelper$$anonfun$flattenRevertValuesAsKeys$$anonfun$apply.apply(ResourceHelper.scala:312)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at com.johnsnowlabs.nlp.util.io.ResourceHelper$$anonfun$flattenRevertValuesAsKeys.apply(ResourceHelper.scala:312)
at com.johnsnowlabs.nlp.util.io.ResourceHelper$$anonfun$flattenRevertValuesAsKeys.apply(ResourceHelper.scala:312)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at com.johnsnowlabs.nlp.util.io.ResourceHelper$.flattenRevertValuesAsKeys(ResourceHelper.scala:312)
at com.johnsnowlabs.nlp.annotators.Lemmatizer.train(Lemmatizer.scala:52)
at com.johnsnowlabs.nlp.annotators.Lemmatizer.train(Lemmatizer.scala:19)
at com.johnsnowlabs.nlp.AnnotatorApproach.fit(AnnotatorApproach.scala:45)
at org.apache.spark.ml.Pipeline$$anonfun$fit.apply(Pipeline.scala:153)
at org.apache.spark.ml.Pipeline$$anonfun$fit.apply(Pipeline.scala:149)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:44)
at scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:37)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:149)
我觉得你的管道很好,所以一切都取决于里面的东西 lemmas001.txt
以及你是否能够在 Windows.
注意:我看到 Windows 上的用户在 Apache Spark 中使用这个:
"C:\Users\something\Desktop\someDirectory\somefile.txt"
如何在 Spark NLP 中训练 Lemmatizer
很简单:
val lemmatizer = new Lemmatizer()
.setInputCols(Array("token"))
.setOutputCol("lemma")
.setDictionary("AntBNC_lemmas_ver_001.txt", "->", "\t")
文件必须具有以下格式,其中 keyDelimiter
在本例中为 ->
,valueDelimiter
为 \t
:
abnormal -> abnormal abnormals
abode -> abode abodes
abolish -> abolishing abolished abolish abolishes
abolitionist -> abolitionist abolitionists
abominate -> abominate abominated abominates
abomination -> abomination abominations
aboriginal -> aboriginal aboriginals
aborigine -> aborigines aborigine
abort -> aborted abort aborts aborting
abortifacient -> abortifacients abortifacient
abortionist -> abortionist abortionists
abortion -> abortion abortions
abo -> abo abos
abotrite -> abotrites abotrite
abound -> abound abounds abounding abounded
此外,如果您不想训练自己的 Lemmatizer,您可以使用如下预训练模型:
英语
val lemmatizer = new LemmatizerModel.pretrained(name="lemma_antbnc", lang="en")
.setInputCols(Array("token"))
.setOutputCol("lemma")
法语
val lemmatizer = new LemmatizerModel.pretrained(name="lemma", lang="fr")
.setInputCols(Array("token"))
.setOutputCol("lemma")
意大利语
val lemmatizer = new LemmatizerModel.pretrained(name="lemma", lang="it")
.setInputCols(Array("token"))
.setOutputCol("lemma")
德语
val lemmatizer = new LemmatizerModel.pretrained(name="lemma", lang="de")
.setInputCols(Array("token"))
.setOutputCol("lemma")
这里是所有预训练模型的列表: https://nlp.johnsnowlabs.com/docs/en/models
所有预训练管道的列表在这里: https://nlp.johnsnowlabs.com/docs/en/pipelines
如果您有更多问题,请在评论中告诉我。
完全公开:我是Spark NLP库的贡献者之一。
更新:如果您有兴趣,我在 Scala on Databricks 中为您找到了这个示例(这实际上是他们训练 Italian Lemmatizer 模型的方式)