使用 java 的 BiGrams Spark
BiGrams Spark using java
我已经在 RDD 中有了句子,输出如下:
RT @DougJ7777: If Britain wins #Eurovision then we have to rejoin the
EU. It's in the rules. #Eurovision2018 RT @Mystificus: Of course I'll
watch #eurovision tonight. After all, 200 million people can't be
wrong, can they? Er...... RT @KlNGNEUER: Me when Europeans make
fun of Eurovision VS Me when Americans make fun of Eurovision
#Eurovision #EuroSemi2 Eurovision song contest 2018 tonight!!!!!! Saturday chills with bae, hands up who’s not watching
Eurovision… @AndrewDawes71 @SuzanneEvans1
@ConstantinStHe1 The tweet was directed at citizens of other countries
partaking in t… Looking forward to @Eurovision
@bbceurovision tonight and rooting for @surieofficial who has strong
competition. Sh… RT @Jem_Collins: Media and
journalism friends, I need you to do something during #Eurovision this
evening. And that something is to drink a… Getting ready for anime AND
Eurovision with friends tonight!
但是当我尝试用“.”拆分它时和“,”我使用这段代码只得到一个空的txt:
JavaRDD<String> sentences= lines.flatMap( line -> Arrays.asList(line.split(".")).iterator());
JavaRDD<String> words = sentences.flatMap( line -> Arrays.asList(line.split(" ")).iterator());
其中 lines 是一个包含屏幕截图内容的 RDD。
之后,如何构造二元组?
重现示例:
SparkConf conf = new SparkConf().setAppName("BiGramsApp");
JavaSparkContext sparkContext = new JavaSparkContext(conf);
JavaRDD<String> inputFile = sparkContext.textFile(input);
JavaRDD<String> sentences = inputFile.flatMap( line -> Arrays.asList(line.split(".")).iterator());
JavaRDD<String> words = sentences.flatMap( line -> Arrays.asList(line.split(" ")).iterator());
words.saveAsTextFile(outputDir);
输入文件将是一个.txt,里面有任何句子,但你可以试试写在开头的字符串
拆分的解决方案是在"[.]"
或"[ ]"
之间添加模式
我已经在 RDD 中有了句子,输出如下:
RT @DougJ7777: If Britain wins #Eurovision then we have to rejoin the EU. It's in the rules. #Eurovision2018 RT @Mystificus: Of course I'll watch #eurovision tonight. After all, 200 million people can't be wrong, can they? Er...... RT @KlNGNEUER: Me when Europeans make fun of Eurovision VS Me when Americans make fun of Eurovision
#Eurovision #EuroSemi2 Eurovision song contest 2018 tonight!!!!!! Saturday chills with bae, hands up who’s not watching Eurovision… @AndrewDawes71 @SuzanneEvans1 @ConstantinStHe1 The tweet was directed at citizens of other countries partaking in t… Looking forward to @Eurovision @bbceurovision tonight and rooting for @surieofficial who has strong competition. Sh… RT @Jem_Collins: Media and journalism friends, I need you to do something during #Eurovision this evening. And that something is to drink a… Getting ready for anime AND Eurovision with friends tonight!
但是当我尝试用“.”拆分它时和“,”我使用这段代码只得到一个空的txt:
JavaRDD<String> sentences= lines.flatMap( line -> Arrays.asList(line.split(".")).iterator());
JavaRDD<String> words = sentences.flatMap( line -> Arrays.asList(line.split(" ")).iterator());
其中 lines 是一个包含屏幕截图内容的 RDD。
之后,如何构造二元组?
重现示例:
SparkConf conf = new SparkConf().setAppName("BiGramsApp");
JavaSparkContext sparkContext = new JavaSparkContext(conf);
JavaRDD<String> inputFile = sparkContext.textFile(input);
JavaRDD<String> sentences = inputFile.flatMap( line -> Arrays.asList(line.split(".")).iterator());
JavaRDD<String> words = sentences.flatMap( line -> Arrays.asList(line.split(" ")).iterator());
words.saveAsTextFile(outputDir);
输入文件将是一个.txt,里面有任何句子,但你可以试试写在开头的字符串
拆分的解决方案是在"[.]"
或"[ ]"