Scala 和 Spark 中最简单的文本词形还原方法

Question

我想对文本文件使用词形还原：

surprise heard thump opened door small seedy man clasping package wrapped.

upgrading system found review spring 2008 issue moody audio backed.

omg left gotta wrap review order asap . understand hand delivered dali lama

speak hands wear earplugs lives . listen maintain link long .

cables cables finally able hear gem long rumored music .
...

预期输出为：

surprise heard thump open door small seed man clasp package wrap.

upgrade system found review spring 2008 issue mood audio back.

omg left gotta wrap review order asap . understand hand deliver dali lama

speak hand wear earplug live . listen maintain link long .

cable cable final able hear gem long rumor music .
...

有人可以帮助我吗？谁知道在 Scala 和 Spark 中实现的最简单的词形还原方法？

Answer 1

Spark 中 Adavanced analitics 一书的功能，关于 Lemmatization 的章节：

  val plainText =  sc.parallelize(List("Sentence to be precessed."))

  val stopWords = Set("stopWord")

  import edu.stanford.nlp.pipeline._
  import edu.stanford.nlp.ling.CoreAnnotations._
  import scala.collection.JavaConversions._

  def plainTextToLemmas(text: String, stopWords: Set[String]): Seq[String] = {
    val props = new Properties()
    props.put("annotators", "tokenize, ssplit, pos, lemma")
    val pipeline = new StanfordCoreNLP(props)
    val doc = new Annotation(text)
    pipeline.annotate(doc)
    val lemmas = new ArrayBuffer[String]()
    val sentences = doc.get(classOf[SentencesAnnotation])
    for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
      val lemma = token.get(classOf[LemmaAnnotation])
      if (lemma.length > 2 && !stopWords.contains(lemma)) {
        lemmas += lemma.toLowerCase
      }
    }
    lemmas
  }

  val lemmatized = plainText.map(plainTextToLemmas(_, stopWords))
  lemmatized.foreach(println)

现在只需对映射器中的每一行使用它。

val lemmatized = plainText.map(plainTextToLemmas(_, stopWords))

编辑：

我添加到代码行

import scala.collection.JavaConversions._

这是必需的，否则句子 Java 不是 Scala 列表。现在应该可以正常编译了。

我使用了 scala 2.10.4 和休闲 stanford.nlp 依赖项：

<dependency>
  <groupId>edu.stanford.nlp</groupId>
  <artifactId>stanford-corenlp</artifactId>
  <version>3.5.2</version>
</dependency>
<dependency>
  <groupId>edu.stanford.nlp</groupId>
  <artifactId>stanford-corenlp</artifactId>
  <version>3.5.2</version>
  <classifier>models</classifier>
</dependency>

你也可以看看stanford.nlp页面有很多例子（在Java）http://nlp.stanford.edu/software/corenlp.shtml.

编辑：

MapPartition 版本：

虽然我不知道它是否会显着加快工作速度。

  def plainTextToLemmas(text: String, stopWords: Set[String], pipeline: StanfordCoreNLP): Seq[String] = {
    val doc = new Annotation(text)
    pipeline.annotate(doc)
    val lemmas = new ArrayBuffer[String]()
    val sentences = doc.get(classOf[SentencesAnnotation])
    for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
      val lemma = token.get(classOf[LemmaAnnotation])
      if (lemma.length > 2 && !stopWords.contains(lemma)) {
        lemmas += lemma.toLowerCase
      }
    }
    lemmas
  }

  val lemmatized = plainText.mapPartitions(p => {
    val props = new Properties()
    props.put("annotators", "tokenize, ssplit, pos, lemma")
    val pipeline = new StanfordCoreNLP(props)
    p.map(q => plainTextToLemmas(q, stopWords, pipeline))
  })
  lemmatized.foreach(println)

Answer 2

我认为@user52045 的想法是正确的。我要做的唯一修改是使用 mapPartitions 而不是 map —— 这允许您只为每个分区创建一次可能很昂贵的管道。这对词形还原管道可能不是一个巨大的打击，但如果你想做一些需要模型的事情，比如管道的 NER 部分，这将是非常重要的。

def plainTextToLemmas(text: String, stopWords: Set[String], pipeline:StanfordCoreNLP): Seq[String] = {
  val doc = new Annotation(text)
  pipeline.annotate(doc)
  val lemmas = new ArrayBuffer[String]()
  val sentences = doc.get(classOf[SentencesAnnotation])
  for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
    val lemma = token.get(classOf[LemmaAnnotation])
    if (lemma.length > 2 && !stopWords.contains(lemma)) {
      lemmas += lemma.toLowerCase
    }
  }
  lemmas
}

val lemmatized = plainText.mapPartitions(strings => {
  val props = new Properties()
  props.put("annotators", "tokenize, ssplit, pos, lemma")
  val pipeline = new StanfordCoreNLP(props)
  strings.map(string => plainTextToLemmas(string, stopWords, pipeline))
})
lemmatized.foreach(println)

Answer 3

我建议使用适用于 Apache Spark 的 Stanford CoreNLP 包装器，因为它为基本核心 nlp 函数（如词形还原、标记化等）提供了官方 API

我在 spark 数据帧上使用了相同的词形还原。

Link 使用 :https://github.com/databricks/spark-corenlp

Scala 和 Spark 中最简单的文本词形还原方法

Simplest method for text lemmatization in Scala and Spark

text

scala

lemmatization

apache-spark

databricks