Scala 和 Spark 中最简单的文本词形还原方法
Simplest method for text lemmatization in Scala and Spark
我想对文本文件使用词形还原:
surprise heard thump opened door small seedy man clasping package wrapped.
upgrading system found review spring 2008 issue moody audio backed.
omg left gotta wrap review order asap . understand hand delivered dali lama
speak hands wear earplugs lives . listen maintain link long .
cables cables finally able hear gem long rumored music .
...
预期输出为:
surprise heard thump open door small seed man clasp package wrap.
upgrade system found review spring 2008 issue mood audio back.
omg left gotta wrap review order asap . understand hand deliver dali lama
speak hand wear earplug live . listen maintain link long .
cable cable final able hear gem long rumor music .
...
有人可以帮助我吗?谁知道在 Scala 和 Spark 中实现的最简单的词形还原方法?
Spark 中 Adavanced analitics 一书的功能,关于 Lemmatization 的章节:
val plainText = sc.parallelize(List("Sentence to be precessed."))
val stopWords = Set("stopWord")
import edu.stanford.nlp.pipeline._
import edu.stanford.nlp.ling.CoreAnnotations._
import scala.collection.JavaConversions._
def plainTextToLemmas(text: String, stopWords: Set[String]): Seq[String] = {
val props = new Properties()
props.put("annotators", "tokenize, ssplit, pos, lemma")
val pipeline = new StanfordCoreNLP(props)
val doc = new Annotation(text)
pipeline.annotate(doc)
val lemmas = new ArrayBuffer[String]()
val sentences = doc.get(classOf[SentencesAnnotation])
for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
val lemma = token.get(classOf[LemmaAnnotation])
if (lemma.length > 2 && !stopWords.contains(lemma)) {
lemmas += lemma.toLowerCase
}
}
lemmas
}
val lemmatized = plainText.map(plainTextToLemmas(_, stopWords))
lemmatized.foreach(println)
现在只需对映射器中的每一行使用它。
val lemmatized = plainText.map(plainTextToLemmas(_, stopWords))
编辑:
我添加到代码行
import scala.collection.JavaConversions._
这是必需的,否则句子 Java 不是 Scala 列表。现在应该可以正常编译了。
我使用了 scala 2.10.4 和休闲 stanford.nlp 依赖项:
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.5.2</version>
</dependency>
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.5.2</version>
<classifier>models</classifier>
</dependency>
你也可以看看stanford.nlp页面有很多例子(在Java)http://nlp.stanford.edu/software/corenlp.shtml.
编辑:
MapPartition 版本:
虽然我不知道它是否会显着加快工作速度。
def plainTextToLemmas(text: String, stopWords: Set[String], pipeline: StanfordCoreNLP): Seq[String] = {
val doc = new Annotation(text)
pipeline.annotate(doc)
val lemmas = new ArrayBuffer[String]()
val sentences = doc.get(classOf[SentencesAnnotation])
for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
val lemma = token.get(classOf[LemmaAnnotation])
if (lemma.length > 2 && !stopWords.contains(lemma)) {
lemmas += lemma.toLowerCase
}
}
lemmas
}
val lemmatized = plainText.mapPartitions(p => {
val props = new Properties()
props.put("annotators", "tokenize, ssplit, pos, lemma")
val pipeline = new StanfordCoreNLP(props)
p.map(q => plainTextToLemmas(q, stopWords, pipeline))
})
lemmatized.foreach(println)
我认为@user52045 的想法是正确的。我要做的唯一修改是使用 mapPartitions 而不是 map —— 这允许您只为每个分区创建一次可能很昂贵的管道。这对词形还原管道可能不是一个巨大的打击,但如果你想做一些需要模型的事情,比如管道的 NER 部分,这将是非常重要的。
def plainTextToLemmas(text: String, stopWords: Set[String], pipeline:StanfordCoreNLP): Seq[String] = {
val doc = new Annotation(text)
pipeline.annotate(doc)
val lemmas = new ArrayBuffer[String]()
val sentences = doc.get(classOf[SentencesAnnotation])
for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
val lemma = token.get(classOf[LemmaAnnotation])
if (lemma.length > 2 && !stopWords.contains(lemma)) {
lemmas += lemma.toLowerCase
}
}
lemmas
}
val lemmatized = plainText.mapPartitions(strings => {
val props = new Properties()
props.put("annotators", "tokenize, ssplit, pos, lemma")
val pipeline = new StanfordCoreNLP(props)
strings.map(string => plainTextToLemmas(string, stopWords, pipeline))
})
lemmatized.foreach(println)
我建议使用适用于 Apache Spark 的 Stanford CoreNLP 包装器,因为它为基本核心 nlp 函数(如词形还原、标记化等)提供了官方 API
我在 spark 数据帧上使用了相同的词形还原。
我想对文本文件使用词形还原:
surprise heard thump opened door small seedy man clasping package wrapped.
upgrading system found review spring 2008 issue moody audio backed.
omg left gotta wrap review order asap . understand hand delivered dali lama
speak hands wear earplugs lives . listen maintain link long .
cables cables finally able hear gem long rumored music .
...
预期输出为:
surprise heard thump open door small seed man clasp package wrap.
upgrade system found review spring 2008 issue mood audio back.
omg left gotta wrap review order asap . understand hand deliver dali lama
speak hand wear earplug live . listen maintain link long .
cable cable final able hear gem long rumor music .
...
有人可以帮助我吗?谁知道在 Scala 和 Spark 中实现的最简单的词形还原方法?
Spark 中 Adavanced analitics 一书的功能,关于 Lemmatization 的章节:
val plainText = sc.parallelize(List("Sentence to be precessed."))
val stopWords = Set("stopWord")
import edu.stanford.nlp.pipeline._
import edu.stanford.nlp.ling.CoreAnnotations._
import scala.collection.JavaConversions._
def plainTextToLemmas(text: String, stopWords: Set[String]): Seq[String] = {
val props = new Properties()
props.put("annotators", "tokenize, ssplit, pos, lemma")
val pipeline = new StanfordCoreNLP(props)
val doc = new Annotation(text)
pipeline.annotate(doc)
val lemmas = new ArrayBuffer[String]()
val sentences = doc.get(classOf[SentencesAnnotation])
for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
val lemma = token.get(classOf[LemmaAnnotation])
if (lemma.length > 2 && !stopWords.contains(lemma)) {
lemmas += lemma.toLowerCase
}
}
lemmas
}
val lemmatized = plainText.map(plainTextToLemmas(_, stopWords))
lemmatized.foreach(println)
现在只需对映射器中的每一行使用它。
val lemmatized = plainText.map(plainTextToLemmas(_, stopWords))
编辑:
我添加到代码行
import scala.collection.JavaConversions._
这是必需的,否则句子 Java 不是 Scala 列表。现在应该可以正常编译了。
我使用了 scala 2.10.4 和休闲 stanford.nlp 依赖项:
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.5.2</version>
</dependency>
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.5.2</version>
<classifier>models</classifier>
</dependency>
你也可以看看stanford.nlp页面有很多例子(在Java)http://nlp.stanford.edu/software/corenlp.shtml.
编辑:
MapPartition 版本:
虽然我不知道它是否会显着加快工作速度。
def plainTextToLemmas(text: String, stopWords: Set[String], pipeline: StanfordCoreNLP): Seq[String] = {
val doc = new Annotation(text)
pipeline.annotate(doc)
val lemmas = new ArrayBuffer[String]()
val sentences = doc.get(classOf[SentencesAnnotation])
for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
val lemma = token.get(classOf[LemmaAnnotation])
if (lemma.length > 2 && !stopWords.contains(lemma)) {
lemmas += lemma.toLowerCase
}
}
lemmas
}
val lemmatized = plainText.mapPartitions(p => {
val props = new Properties()
props.put("annotators", "tokenize, ssplit, pos, lemma")
val pipeline = new StanfordCoreNLP(props)
p.map(q => plainTextToLemmas(q, stopWords, pipeline))
})
lemmatized.foreach(println)
我认为@user52045 的想法是正确的。我要做的唯一修改是使用 mapPartitions 而不是 map —— 这允许您只为每个分区创建一次可能很昂贵的管道。这对词形还原管道可能不是一个巨大的打击,但如果你想做一些需要模型的事情,比如管道的 NER 部分,这将是非常重要的。
def plainTextToLemmas(text: String, stopWords: Set[String], pipeline:StanfordCoreNLP): Seq[String] = {
val doc = new Annotation(text)
pipeline.annotate(doc)
val lemmas = new ArrayBuffer[String]()
val sentences = doc.get(classOf[SentencesAnnotation])
for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
val lemma = token.get(classOf[LemmaAnnotation])
if (lemma.length > 2 && !stopWords.contains(lemma)) {
lemmas += lemma.toLowerCase
}
}
lemmas
}
val lemmatized = plainText.mapPartitions(strings => {
val props = new Properties()
props.put("annotators", "tokenize, ssplit, pos, lemma")
val pipeline = new StanfordCoreNLP(props)
strings.map(string => plainTextToLemmas(string, stopWords, pipeline))
})
lemmatized.foreach(println)
我建议使用适用于 Apache Spark 的 Stanford CoreNLP 包装器,因为它为基本核心 nlp 函数(如词形还原、标记化等)提供了官方 API
我在 spark 数据帧上使用了相同的词形还原。