一个词的数据集上的 NGram

Question

我正在尝试使用 SparkML，尝试使用 Spark 的 OOB 功能构建模糊匹配。在此过程中，我正在构建 n=2 的 NGrams。但是，我的数据集中的某些行包含 Spark 管道失败的单个单词。不管 Spark，想知道解决这个问题的一般方法是什么。 IE。如果令牌怎么办

Answer 1

SCALA 方法。通常它也应该使用 1 个单词并且不会失败，崩溃。使用非 MLLIB 但滑动你会得到 1 的二元组，这当然是有争议的，带有句子解析。像这样：

val rdd = sc.parallelize(Array("Hello my Friend. How are",
                               "you today? bye my friend.",
                               "singleword"))
rdd.map{ 
    // Split each line into substrings by periods
    _.split('.').map{ substrings =>
        // Trim substrings and then tokenize on spaces
        substrings.trim.split(' ').map{_.replaceAll("""\W""", "").toLowerCase()}.
        // Find bigrams, etc.
        sliding(2)
     }.
    // Flatten, and map the ngrams to concatenated strings
    flatMap{identity}.map{_.mkString(" ")}.
    // Group the bigrams and count their frequency
    groupBy{identity}.mapValues{_.size}
}.
// Reduce to get a global count, then collect.  
flatMap{identity}.reduceByKey(_+_).collect.
// Print
foreach{x=> println(x._1 + ", " + x._2)}

这不会在 "singleword" 上失败，但会给你一个字：

you today, 1
hello my, 1
singleword, 1
my friend, 2
how are, 1
bye my, 1
today bye, 1

使用 mllib 并使用此输入遍历行：

the quick brown fox.
singleword.
two words.

使用：

import org.apache.spark.mllib.rdd.RDDFunctions._
val wordsRdd = sc.textFile("/FileStore/tables/sliding.txt",1)
val wordsRDDTextSplit = wordsRdd.map(line => (line.trim.split(" "))).flatMap(x => x).map(x => (x.toLowerCase())).map(x => x.replaceAll(",{1,}","")).map(x => x.replaceAll("!{1,}",".")).map(x => x.replaceAll("\?{1,}",".")).map(x => x.replaceAll("\.{1,}",".")).map(x => x.replaceAll("\W+",".")).filter(_ != ".").filter(_ != "")
.map(x => x.replace(".","")).sliding(2).collect

你得到：

 wordsRDDTextSplit: Array[Array[String]] = Array(Array(the, quick), Array(quick, brown), Array(brown, fox), Array(fox, singleword), Array(singleword, two), Array(two, words))

注意我解析的行不同。

当运行上面只有一行1个字时，我得到空输出。

wordsRDDTextSplit: Array[Array[String]] = Array()

所以，你看你可以处理或不处理等

一个词的数据集上的 NGram

NGram on dataset with one word

nlp

n-gram

apache-spark

apache-spark-ml

apache-spark-mllib