如何在 scala spark 中将句子拆分为 map(case(key,value)=>...) 中的单词
How to split sentences into words inside map(case(key,value)=>...) in scala spark
val sparkConf = new SparkConf().setAppName("ShortTwitterAnalysis").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val text = sc.textFile("/home/tobbyj/HW1_INF553/shortTwitter.txt")
val twitter = text
.map(_.toLowerCase)
.map(_.replace("\t", ""))
.map(_.replace("\"", ""))
.map(_.replace("\n", ""))
.map(_.replace(".", ""))
.map(_.replaceAll("[\p{C}]", ""))
.map(_.split("text:")(1).split(",source:")(0))
.zipWithIndex.map(_.swap)
使用上面的代码得到如下结果。
(0,a rose by any other name would smell as sweet)
(1,a rose is a rose is a rose)
(4,rt @nba2k: the battle of two young teams tough season but one will emerge victorious who will it be? lakers or 76ers? https:\/\/tco\/nukkjq\u2026)
(2,love is like a rose the joy of all the earth)
(5,i was going to bake a cake and listen to the football flour refund?)
(3,at christmas i no more desire a rose than wish a snow in may’s new-fangled mirth)
然而,我想要的结果是 'key' 从 1 开始, 'value' 分成如下的单词供您理解,尽管我不确定它是否会像下面这样。
(1,(a, rose, by, any, other, name, would, smell, as, sweet))
(2,(a, rose, is, a, rose, is, a, rose))
...
我累的代码是
.map{case(key, value)=>(key+1, value.split(" "))}
但给我结果如下
(1,[Ljava.lang.String;@1dff58b)
(2,[Ljava.lang.String;@167179a3)
(3,[Ljava.lang.String;@73e8c7d7)
(4,[Ljava.lang.String;@7bffa418)
(5,[Ljava.lang.String;@2d385beb)
(6,[Ljava.lang.String;@4f1ab87e)
有什么建议吗?在这一步之后,我将把它们映射成 (1, a), (1, rose), (1, by)...(2, love), (2, rose), ....
您可以导入 org.apache.spark.rdd.PairRDDFunctions
(已记录 here)以更轻松地处理键值对。
到时候,你可以使用flatMapValues
的方法来获取你想要的东西;这是一个最小的工作示例(如果您在 Spark 控制台中,只需从包含 val tweets
的行复制):
import org.apache.spark._
import org.apache.spark.rdd.PairRDDFunctions
val conf = new SparkConf().setAppName("test").setMaster("local[*]")
val sc = new SparkContext(conf)
val tweets = sc.parallelize(Seq(
"this is my first tweet",
"and this is my second",
"ok this is getting boring"))
val results =
tweets.
zipWithIndex.
map(_.swap).
flatMapValues(_.split(" "))
results.collect.foreach(println)
这是这几行代码的输出:
(0,this)
(0,is)
(0,my)
(0,first)
(0,tweet)
(1,and)
(1,this)
(1,is)
(1,my)
(1,second)
(2,ok)
(2,this)
(2,is)
(2,getting)
(2,boring)
如果您有兴趣查看一个展示如何使用 Spark Streaming 分析实时 Twitter 提要的小示例,您可以找到一个 here。
val sparkConf = new SparkConf().setAppName("ShortTwitterAnalysis").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val text = sc.textFile("/home/tobbyj/HW1_INF553/shortTwitter.txt")
val twitter = text
.map(_.toLowerCase)
.map(_.replace("\t", ""))
.map(_.replace("\"", ""))
.map(_.replace("\n", ""))
.map(_.replace(".", ""))
.map(_.replaceAll("[\p{C}]", ""))
.map(_.split("text:")(1).split(",source:")(0))
.zipWithIndex.map(_.swap)
使用上面的代码得到如下结果。
(0,a rose by any other name would smell as sweet)
(1,a rose is a rose is a rose)
(4,rt @nba2k: the battle of two young teams tough season but one will emerge victorious who will it be? lakers or 76ers? https:\/\/tco\/nukkjq\u2026)
(2,love is like a rose the joy of all the earth)
(5,i was going to bake a cake and listen to the football flour refund?)
(3,at christmas i no more desire a rose than wish a snow in may’s new-fangled mirth)
然而,我想要的结果是 'key' 从 1 开始, 'value' 分成如下的单词供您理解,尽管我不确定它是否会像下面这样。
(1,(a, rose, by, any, other, name, would, smell, as, sweet))
(2,(a, rose, is, a, rose, is, a, rose))
...
我累的代码是
.map{case(key, value)=>(key+1, value.split(" "))}
但给我结果如下
(1,[Ljava.lang.String;@1dff58b)
(2,[Ljava.lang.String;@167179a3)
(3,[Ljava.lang.String;@73e8c7d7)
(4,[Ljava.lang.String;@7bffa418)
(5,[Ljava.lang.String;@2d385beb)
(6,[Ljava.lang.String;@4f1ab87e)
有什么建议吗?在这一步之后,我将把它们映射成 (1, a), (1, rose), (1, by)...(2, love), (2, rose), ....
您可以导入 org.apache.spark.rdd.PairRDDFunctions
(已记录 here)以更轻松地处理键值对。
到时候,你可以使用flatMapValues
的方法来获取你想要的东西;这是一个最小的工作示例(如果您在 Spark 控制台中,只需从包含 val tweets
的行复制):
import org.apache.spark._
import org.apache.spark.rdd.PairRDDFunctions
val conf = new SparkConf().setAppName("test").setMaster("local[*]")
val sc = new SparkContext(conf)
val tweets = sc.parallelize(Seq(
"this is my first tweet",
"and this is my second",
"ok this is getting boring"))
val results =
tweets.
zipWithIndex.
map(_.swap).
flatMapValues(_.split(" "))
results.collect.foreach(println)
这是这几行代码的输出:
(0,this)
(0,is)
(0,my)
(0,first)
(0,tweet)
(1,and)
(1,this)
(1,is)
(1,my)
(1,second)
(2,ok)
(2,this)
(2,is)
(2,getting)
(2,boring)
如果您有兴趣查看一个展示如何使用 Spark Streaming 分析实时 Twitter 提要的小示例,您可以找到一个 here。