在管道中混合 Smark MLLIB 和 SparkNLP
Mix Smark MLLIB and SparkNLP in pipeline
在 MLLIB 管道中,如何在 Stemmer(来自 Spark NLP)之后链接 CountVectorizer(来自 SparkML)?
当我尝试在管道中同时使用两者时,我得到:
myColName must be of type equal to one of the following types: [array<string>, array<string>] but was actually of type array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>.
此致,
您需要在 Spark NLP 管道中添加一个 Finisher。试试看:
val documentAssembler =
new DocumentAssembler().setInputCol("text").setOutputCol("document")
val sentenceDetector =
new SentenceDetector().setInputCols("document").setOutputCol("sentences")
val tokenizer =
new Tokenizer().setInputCols("sentences").setOutputCol("token")
val stemmer = new Stemmer()
.setInputCols("token")
.setOutputCol("stem")
val finisher = new Finisher()
.setInputCols("stem")
.setOutputCols("token_features")
.setOutputAsArray(true)
.setCleanAnnotations(false)
val cv = new CountVectorizer()
.setInputCol("token_features")
.setOutputCol("features")
val pipeline = new Pipeline()
.setStages(
Array(
documentAssembler,
sentenceDetector,
tokenizer,
stemmer,
finisher,
cv
))
val data =
Seq("Peter Pipers employees are picking pecks of pickled peppers.")
.toDF("text")
val model = pipeline.fit(data)
val df = model.transform(data)
输出:
+--------------------------------------------------------------------+
|features |
+--------------------------------------------------------------------+
|(10,[0,1,2,3,4,5,6,7,8,9],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])|
+--------------------------------------------------------------------+
在 MLLIB 管道中,如何在 Stemmer(来自 Spark NLP)之后链接 CountVectorizer(来自 SparkML)?
当我尝试在管道中同时使用两者时,我得到:
myColName must be of type equal to one of the following types: [array<string>, array<string>] but was actually of type array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>.
此致,
您需要在 Spark NLP 管道中添加一个 Finisher。试试看:
val documentAssembler =
new DocumentAssembler().setInputCol("text").setOutputCol("document")
val sentenceDetector =
new SentenceDetector().setInputCols("document").setOutputCol("sentences")
val tokenizer =
new Tokenizer().setInputCols("sentences").setOutputCol("token")
val stemmer = new Stemmer()
.setInputCols("token")
.setOutputCol("stem")
val finisher = new Finisher()
.setInputCols("stem")
.setOutputCols("token_features")
.setOutputAsArray(true)
.setCleanAnnotations(false)
val cv = new CountVectorizer()
.setInputCol("token_features")
.setOutputCol("features")
val pipeline = new Pipeline()
.setStages(
Array(
documentAssembler,
sentenceDetector,
tokenizer,
stemmer,
finisher,
cv
))
val data =
Seq("Peter Pipers employees are picking pecks of pickled peppers.")
.toDF("text")
val model = pipeline.fit(data)
val df = model.transform(data)
输出:
+--------------------------------------------------------------------+
|features |
+--------------------------------------------------------------------+
|(10,[0,1,2,3,4,5,6,7,8,9],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])|
+--------------------------------------------------------------------+