如何在 Scala 中提取二元组和三元组?

How to extract bigrams and trigrams in Scala?


very pleased product . phone lightweight comfortable sound quality good house yard . 
quality construction phone base unit good . ample supply cable adapter . plug computer soundcard .
shop unit mail rebate . unit battery pack hold play time strap carr headphone adapter cable perfect digital copy optical. component micro plug stereo connector cable micro plug rca cable . 
unit primarily record guitar jam session . input plug provide power plug microphone . decent stereo mic need digital recording performance . mono mode double recording time .
admit like new electronic toy . digital camera not impress .



IIUC,这个函数是把上面的整个文档取出来,然后在“.”上拆分文档。这是你的第一个问题。调用 split(".") 并不像您认为的那样。您实际上是在拆分通配符而不是“。”像你要的那样。将其更改为“\”。然后您会将文档拆分成句子。

完成后,我们需要通过简单地拆分空格来将句子拆分成单词,我建议通过执行 _.split(\s+) 来拆分所有空格。现在您应该能够解析单词并使用如下函数创建三元组:

def stringToTrigrams(s: String) = {
  val sentences = s.split("\.")
  sentences flatMap { sent => 
    val words = sent.split("\s+").filter(_ != "")
    if (words.length >= 3)
      words.sliding(3).map(a => trigram(a(0), a(1), a(2))
    else Iterator[trigram]
