将 scala FP-growth RDD 输出转换为数据框
Convert scala FP-growth RDD output to Data frame
https://spark.apache.org/docs/2.1.0/mllib-frequent-pattern-mining.html#fp-growth
sample_fpgrowth.txt 可以在这里找到,
https://github.com/apache/spark/blob/master/data/mllib/sample_fpgrowth.txt
我 运行 上面 link 中的 FP-growth 示例在 Scala 中工作正常,但我需要的是如何将 RDD 中的结果转换为数据帧。
这两个 RDD
model.freqItemsets and
model.generateAssociationRules(minConfidence)
用我问题中给出的例子详细解释。
一旦你有了 rdd
,有很多方法可以创建 dataframe
。其中之一是使用 .toDF
函数,该函数要求 sqlContext.implicits
库为 imported
作为
val sparkSession = SparkSession.builder().appName("udf testings")
.master("local")
.config("", "")
.getOrCreate()
val sc = sparkSession.sparkContext
val sqlContext = sparkSession.sqlContext
import sqlContext.implicits._
之后你读取 fpgrowth
文本文件并转换成 rdd
val data = sc.textFile("path to sample_fpgrowth.txt that you have used")
val transactions: RDD[Array[String]] = data.map(s => s.trim.split(' '))
我使用了问题中提供的Frequent Pattern Mining - RDD-based API中的代码
val fpg = new FPGrowth()
.setMinSupport(0.2)
.setNumPartitions(10)
val model = fpg.run(transactions)
下一步是调用 .toDF
函数
第一个dataframe
model.freqItemsets.map(itemset =>(itemset.items.mkString("[", ",", "]") , itemset.freq)).toDF("items", "freq").show(false)
这将导致
+---------+----+
|items |freq|
+---------+----+
|[z] |5 |
|[x] |4 |
|[x,z] |3 |
|[y] |3 |
|[y,x] |3 |
|[y,x,z] |3 |
|[y,z] |3 |
|[r] |3 |
|[r,x] |2 |
|[r,z] |2 |
|[s] |3 |
|[s,y] |2 |
|[s,y,x] |2 |
|[s,y,x,z]|2 |
|[s,y,z] |2 |
|[s,x] |3 |
|[s,x,z] |2 |
|[s,z] |2 |
|[t] |3 |
|[t,y] |3 |
+---------+----+
only showing top 20 rows
第二次dataframe
val minConfidence = 0.8
model.generateAssociationRules(minConfidence)
.map(rule =>(rule.antecedent.mkString("[", ",", "]"), rule.consequent.mkString("[", ",", "]"), rule.confidence))
.toDF("antecedent", "consequent", "confidence").show(false)
这将导致
+----------+----------+----------+
|antecedent|consequent|confidence|
+----------+----------+----------+
|[t,s,y] |[x] |1.0 |
|[t,s,y] |[z] |1.0 |
|[y,x,z] |[t] |1.0 |
|[y] |[x] |1.0 |
|[y] |[z] |1.0 |
|[y] |[t] |1.0 |
|[p] |[r] |1.0 |
|[p] |[z] |1.0 |
|[q,t,z] |[y] |1.0 |
|[q,t,z] |[x] |1.0 |
|[q,y] |[x] |1.0 |
|[q,y] |[z] |1.0 |
|[q,y] |[t] |1.0 |
|[t,s,x] |[y] |1.0 |
|[t,s,x] |[z] |1.0 |
|[q,t,y,z] |[x] |1.0 |
|[q,t,x,z] |[y] |1.0 |
|[q,x] |[y] |1.0 |
|[q,x] |[t] |1.0 |
|[q,x] |[z] |1.0 |
+----------+----------+----------+
only showing top 20 rows
希望这就是您所需要的
https://spark.apache.org/docs/2.1.0/mllib-frequent-pattern-mining.html#fp-growth
sample_fpgrowth.txt 可以在这里找到, https://github.com/apache/spark/blob/master/data/mllib/sample_fpgrowth.txt
我 运行 上面 link 中的 FP-growth 示例在 Scala 中工作正常,但我需要的是如何将 RDD 中的结果转换为数据帧。 这两个 RDD
model.freqItemsets and
model.generateAssociationRules(minConfidence)
用我问题中给出的例子详细解释。
一旦你有了 rdd
,有很多方法可以创建 dataframe
。其中之一是使用 .toDF
函数,该函数要求 sqlContext.implicits
库为 imported
作为
val sparkSession = SparkSession.builder().appName("udf testings")
.master("local")
.config("", "")
.getOrCreate()
val sc = sparkSession.sparkContext
val sqlContext = sparkSession.sqlContext
import sqlContext.implicits._
之后你读取 fpgrowth
文本文件并转换成 rdd
val data = sc.textFile("path to sample_fpgrowth.txt that you have used")
val transactions: RDD[Array[String]] = data.map(s => s.trim.split(' '))
我使用了问题中提供的Frequent Pattern Mining - RDD-based API中的代码
val fpg = new FPGrowth()
.setMinSupport(0.2)
.setNumPartitions(10)
val model = fpg.run(transactions)
下一步是调用 .toDF
函数
第一个dataframe
model.freqItemsets.map(itemset =>(itemset.items.mkString("[", ",", "]") , itemset.freq)).toDF("items", "freq").show(false)
这将导致
+---------+----+
|items |freq|
+---------+----+
|[z] |5 |
|[x] |4 |
|[x,z] |3 |
|[y] |3 |
|[y,x] |3 |
|[y,x,z] |3 |
|[y,z] |3 |
|[r] |3 |
|[r,x] |2 |
|[r,z] |2 |
|[s] |3 |
|[s,y] |2 |
|[s,y,x] |2 |
|[s,y,x,z]|2 |
|[s,y,z] |2 |
|[s,x] |3 |
|[s,x,z] |2 |
|[s,z] |2 |
|[t] |3 |
|[t,y] |3 |
+---------+----+
only showing top 20 rows
第二次dataframe
val minConfidence = 0.8
model.generateAssociationRules(minConfidence)
.map(rule =>(rule.antecedent.mkString("[", ",", "]"), rule.consequent.mkString("[", ",", "]"), rule.confidence))
.toDF("antecedent", "consequent", "confidence").show(false)
这将导致
+----------+----------+----------+
|antecedent|consequent|confidence|
+----------+----------+----------+
|[t,s,y] |[x] |1.0 |
|[t,s,y] |[z] |1.0 |
|[y,x,z] |[t] |1.0 |
|[y] |[x] |1.0 |
|[y] |[z] |1.0 |
|[y] |[t] |1.0 |
|[p] |[r] |1.0 |
|[p] |[z] |1.0 |
|[q,t,z] |[y] |1.0 |
|[q,t,z] |[x] |1.0 |
|[q,y] |[x] |1.0 |
|[q,y] |[z] |1.0 |
|[q,y] |[t] |1.0 |
|[t,s,x] |[y] |1.0 |
|[t,s,x] |[z] |1.0 |
|[q,t,y,z] |[x] |1.0 |
|[q,t,x,z] |[y] |1.0 |
|[q,x] |[y] |1.0 |
|[q,x] |[t] |1.0 |
|[q,x] |[z] |1.0 |
+----------+----------+----------+
only showing top 20 rows
希望这就是您所需要的