Scala Spark 中的停用词去除器

StopWords remover in Scala Spark

我在 Scala 中遇到了问题。我需要从 RDD[String] txt 文件中删除停用词。

val sc = new SparkContext(conf)

val tweetsPath = args(0)
val outputDataset = args(1)

val tweetsRaw: RDD[String] = sc.textFile(tweetsPath)

val stopWords = Array("a","able","about","across","after","all","almost","also","am","among","an","and","any","are","as","at","be","because","been","but","by","can","cannot","could","dear","did","do","does","either","else","ever","every","for","from","get","got","had","has","have","he","her","hers","him","his","how","however","i","if","in","into","is","it","its","just","least","let","like","likely","may","me","might","most","must","my","neither","no","nor","not","of","off","often","on","only","or","other","our","own","rather","said","say","says","she","should","since","so","some","than","that","the","their","them","then","there","these","they","this","tis","to","too","twas","us","wants","was","we","were","what","when","where","which","while","who","whom","why","will","with","would","yet","you","your","ain't","aren't","can't","could've","couldn't","didn't","doesn't","don't","hasn't","he'd","he'll","he's","how'd","how'll","how's","i'd","i'll","i'm","i've","isn't","it's","might've","mightn't","must've","mustn't","shan't","she'd","she'll","she's","should've","shouldn't","that'll","that's","there's","they'd","they'll","they're","they've","wasn't","we'd","we'll","we're","weren't","what'd","what's","when'd","when'll","when's","where'd","where'll","where's","who'd","who'll","who's","why'd","why'll","why's","won't","would've","wouldn't","you'd","you'll","you're","you've")

val cleanTxt = tweetsRaw.
  filter(x => x.startsWith("San Francisco") || x.startsWith("Chicago") || !stopWords.contains(x));

cleanTxt.saveAsTextFile(outputDataset)

我试过了,但没用。我必须保持相同的结构(使用 SparkConf 而不是移动到 SparkSession)。我该如何选择所有以“Chicago”和“San Francisco”开头的推文,从文本中删除停用词,并在没有这些停用词的情况下逐行输出整个推文?

我对我的 tweetsraw 做了一个平面图,但是有了一个平面图,我只有没有停用词的单词作为输出,但我需要的是没有停用词的整行,而不仅仅是单词。

我希望我清楚自己想要什么,希望您能帮我解决这个问题!

谢谢大家。

P.S。我使用 scala 库中的 StopWordsRemover 方法尝试了很多事情,但我无法弄清楚如何在不初始化 SparkSession 并使用 SparkConf 的情况下使其工作。

How can I do to pick all the tweets starting with "Chicago" and "San Francisco", removing the stopwords from the text and have as an output the whole tweets line by line without those stopwords ?

您的 spark 脚本中的以下行将根据您的条件过滤掉推文。然而,它并没有从行中删除停用词。

val cleanTxt = tweetsRaw.
  filter(x => x.startsWith("San Francisco") || x.startsWith("Chicago") || !stopWords.contains(x));

如果你想删除停用词,那么你必须使用映射转换,它会从一行中删除停用词,然后你可以将它保存到文件中。

假设每一行代表一条 space 分隔的推文,下面是我要删除停用词的方法。

cleanTxt.map(tweet => tweet.split(" ").filterNot(x => stop.contains(x)).mkString(" ").saveAsTextFile(outputDataset)