如何从频率计数在 Spark/Scala 中的文本文件创建二元组？

Question

我想获取一个文本文件并创建一个所有单词的二元组，其中没有用点“.”分隔，删除任何特殊字符。我正在尝试使用 Spark 和 Scala 来做到这一点。

这段文字：

你好我的朋友。怎么样
你今天？再见我的朋友

应该产生以下内容：

你好，1
我的朋友，2
怎么样，1
你今天，1
今天再见，1
再见，1

Answer 1

对于RDD中的每一行，首先根据'.'进行拆分。然后通过在 ' ' 上拆分来标记每个生成的子字符串。标记化后，使用 replaceAll 删除特殊字符并转换为小写。这些子列表中的每一个都可以使用 sliding 转换为包含双字母组的字符串数组的迭代器。

然后，在按要求使用 mkString 将二元数组展平并转换为字符串后，使用 groupBy 和 mapValues.

对每个数组进行计数

最后对 RDD 中的 (bigram, count) 元组进行扁平化、缩减和收集。

val rdd = sc.parallelize(Array("Hello my Friend. How are",
                               "you today? bye my friend."))

rdd.map{ 

    // Split each line into substrings by periods
    _.split('.').map{ substrings =>

        // Trim substrings and then tokenize on spaces
        substrings.trim.split(' ').

        // Remove non-alphanumeric characters, using Shyamendra's
        // clean replacement technique, and convert to lowercase
        map{_.replaceAll("""\W""", "").toLowerCase()}.

        // Find bigrams
        sliding(2)
    }.

    // Flatten, and map the bigrams to concatenated strings
    flatMap{identity}.map{_.mkString(" ")}.

    // Group the bigrams and count their frequency
    groupBy{identity}.mapValues{_.size}

}.

// Reduce to get a global count, then collect
flatMap{identity}.reduceByKey(_+_).collect.

// Format and print
foreach{x=> println(x._1 + ", " + x._2)}

you today, 1
hello my, 1
my friend, 2
how are, 1
bye my, 1
today bye, 1

Answer 2

为了将整个单词与任何标点符号分开考虑例如

val words = text.split("\W+")

在这种情况下交付

Array[String] = Array(Hello, my, Friend, How, are, you, today, bye, my, friend)

将单词配对成元组证明更符合二元组的概念，因此考虑例如

for( Array(a,b,_*) <- words.sliding(2).toArray ) 
yield (a.toLowerCase(), b.toLowerCase())

产生

Array((hello,my), (my,friend), (friend,How), (how,are), 
      (are,you), (you,today), (today,bye), (bye,my), (my,friend))

ohruunuruus 的回答传达了一种简洁的方法。

Answer 3

这应该适用于 Spark：

def bigramsInString(s: String): Array[((String, String), Int)] = { 

    s.split("""\.""")                        // split on .
     .map(_.split(" ")                       // split on space
           .filter(_.nonEmpty)               // remove empty string
           .map(_.replaceAll("""\W""", "")   // remove special chars
                 .toLowerCase)
           .filter(_.nonEmpty)                
           .sliding(2)                       // take continuous pairs
           .filter(_.size == 2)              // sliding can return partial
           .map{ case Array(a, b) => ((a, b), 1) })
     .flatMap(x => x)                         
}

val rdd = sc.parallelize(Array("Hello my Friend. How are",
                               "you today? bye my friend."))

rdd.map(bigramsInString)
   .flatMap(x => x)             
   .countByKey                   // get result in driver memory as Map
   .foreach{ case ((x, y), z) => println(s"${x} ${y}, ${z}") }

// my friend, 2
// how are, 1
// today bye, 1
// bye my, 1
// you today, 1
// hello my, 1

如何从频率计数在 Spark/Scala 中的文本文件创建二元组？

How to create a bigram from a text file with frequency count in Spark/Scala?

scala

n-gram

apache-spark