使用 java 的 BiGrams Spark

BiGrams Spark using java

我已经在 RDD 中有了句子,输出如下:

RT @DougJ7777: If Britain wins #Eurovision then we have to rejoin the EU. It's in the rules. #Eurovision2018 RT @Mystificus: Of course I'll watch #eurovision tonight. After all, 200 million people can't be wrong, can they? Er...... RT @KlNGNEUER: Me when Europeans make fun of Eurovision VS Me when Americans make fun of Eurovision

#Eurovision #EuroSemi2 Eurovision song contest 2018 tonight!!!!!! Saturday chills with bae, hands up who’s not watching Eurovision… @AndrewDawes71 @SuzanneEvans1 @ConstantinStHe1 The tweet was directed at citizens of other countries partaking in t… Looking forward to @Eurovision @bbceurovision tonight and rooting for @surieofficial who has strong competition. Sh… RT @Jem_Collins: Media and journalism friends, I need you to do something during #Eurovision this evening. And that something is to drink a… Getting ready for anime AND Eurovision with friends tonight!

但是当我尝试用“.”拆分它时和“,”我使用这段代码只得到一个空的txt:

JavaRDD<String> sentences= lines.flatMap( line -> Arrays.asList(line.split(".")).iterator());
JavaRDD<String> words = sentences.flatMap( line -> Arrays.asList(line.split(" ")).iterator());

其中 lines 是一个包含屏幕截图内容的 RDD。

之后,如何构造二元组?

重现示例:

SparkConf conf = new SparkConf().setAppName("BiGramsApp");
JavaSparkContext sparkContext = new JavaSparkContext(conf);
JavaRDD<String> inputFile = sparkContext.textFile(input);
JavaRDD<String> sentences = inputFile.flatMap(  line -> Arrays.asList(line.split(".")).iterator());
JavaRDD<String> words = sentences.flatMap( line -> Arrays.asList(line.split(" ")).iterator());
    
words.saveAsTextFile(outputDir);

输入文件将是一个.txt,里面有任何句子,但你可以试试写在开头的字符串

拆分的解决方案是在"[.]""[ ]"

之间添加模式