PageRank 使用 GraphX
PageRank using GraphX
我有一个 .txt 文件 list.txt,它包含源和目标列表 URL,格式为
google.de/2011/10/Extract-host link.de/2011/10/extact-host
facebook.de/2014/11/photos facebook.de/2014/11/name.jpg
community.cloudera.com/t5/ community.cloudera.com/t10/
facebook.de/2014/11/photos link.de/2011/10/extact-host
借助这个post,
我尝试创建节点和边,例如:
val test = sc.textFile("list.txt") //running
val arrayForm = test.map(_.split("\t")) // running
val nodes: RDD[(VertexId, Option[String])] = arrayForm.flatMap(array => array).
map((_.toLong None))
val edges: RDD[Edge[String]] = arrayForm.
map(line => Edge(line(0), line(1), ""))
这里的问题是我真的不知道如何从字符串数据类型创建 VertexId 和类似的边。请让我知道如何解决这个问题。
答案是散列。由于您的 VertexID 是字符串,您可以使用 MurmurHash3
对它们进行哈希处理,制作图形,做您想做的事,然后将哈希值与原始字符串匹配。
示例代码
package com.void
import org.apache.spark._
import org.apache.spark.rdd.RDD
import org.apache.spark.graphx.Graph
import org.apache.spark.graphx.VertexId
import scala.util.hashing.MurmurHash3
object Main {
def main( args: Array[ String ] ): Unit = {
val conf =
new SparkConf()
.setAppName( "SO Spark" )
.setMaster( "local[*]" )
.set( "spark.driver.host", "localhost" )
val sc = new SparkContext( conf )
val file = sc.textFile("data/pr_data.txt");
val edgesRDD: RDD[(VertexId, VertexId)] =
file
.map( line => line.split( "\t" ) )
.map( line => (
MurmurHash3.stringHash( line( 0 ).toString ), MurmurHash3.stringHash( line( 1 ).toString )
)
)
val graph = Graph.fromEdgeTuples( edgesRDD, 1 )
// graph.triplets.collect.foreach( println )
// println( "####" )
val ranks =
graph
.pageRank( 0.0001 )
.vertices
ranks.foreach( println )
println( "####" )
val identificationMap =
file
.flatMap( line => line.split( "\t" ) )
.distinct
.map( line => ( MurmurHash3.stringHash( line.toString ).toLong, line ) )
identificationMap.foreach( println )
println( "####" )
val fullMap =
ranks
.join( identificationMap )
fullMap.foreach( println )
sc.stop()
}
}
结果
(-1578471469,1.2982456140350878)
(1547760250,0.7017543859649124)
(1657711982,1.0000000000000002)
(1797439709,0.7017543859649124)
(996122257,0.7017543859649124)
(-1127017098,1.5964912280701753)
####
(1547760250,community.cloudera.com/t5/)
(-1127017098,link.de/2011/10/extact-host)
(1657711982,facebook.de/2014/11/name.jpg)
(1797439709,facebook.de/2014/11/photos)
(-1578471469,community.cloudera.com/t10/)
(996122257,google.de/2011/10/Extract-host)
####
(-1578471469,(1.2982456140350878,community.cloudera.com/t10/))
(1797439709,(0.7017543859649124,facebook.de/2014/11/photos))
(1547760250,(0.7017543859649124,community.cloudera.com/t5/))
(996122257,(0.7017543859649124,google.de/2011/10/Extract-host))
(1657711982,(1.0000000000000002,facebook.de/2014/11/name.jpg))
(-1127017098,(1.5964912280701753,link.de/2011/10/extact-host))
您可以通过映射将散列 ID 从 RDD 中删除,但我相信 PageRank 不是您的最终目标,因此您稍后可能需要它们。
我有一个 .txt 文件 list.txt,它包含源和目标列表 URL,格式为
google.de/2011/10/Extract-host link.de/2011/10/extact-host
facebook.de/2014/11/photos facebook.de/2014/11/name.jpg
community.cloudera.com/t5/ community.cloudera.com/t10/
facebook.de/2014/11/photos link.de/2011/10/extact-host
借助这个post,
val test = sc.textFile("list.txt") //running
val arrayForm = test.map(_.split("\t")) // running
val nodes: RDD[(VertexId, Option[String])] = arrayForm.flatMap(array => array).
map((_.toLong None))
val edges: RDD[Edge[String]] = arrayForm.
map(line => Edge(line(0), line(1), ""))
这里的问题是我真的不知道如何从字符串数据类型创建 VertexId 和类似的边。请让我知道如何解决这个问题。
答案是散列。由于您的 VertexID 是字符串,您可以使用 MurmurHash3
对它们进行哈希处理,制作图形,做您想做的事,然后将哈希值与原始字符串匹配。
示例代码
package com.void
import org.apache.spark._
import org.apache.spark.rdd.RDD
import org.apache.spark.graphx.Graph
import org.apache.spark.graphx.VertexId
import scala.util.hashing.MurmurHash3
object Main {
def main( args: Array[ String ] ): Unit = {
val conf =
new SparkConf()
.setAppName( "SO Spark" )
.setMaster( "local[*]" )
.set( "spark.driver.host", "localhost" )
val sc = new SparkContext( conf )
val file = sc.textFile("data/pr_data.txt");
val edgesRDD: RDD[(VertexId, VertexId)] =
file
.map( line => line.split( "\t" ) )
.map( line => (
MurmurHash3.stringHash( line( 0 ).toString ), MurmurHash3.stringHash( line( 1 ).toString )
)
)
val graph = Graph.fromEdgeTuples( edgesRDD, 1 )
// graph.triplets.collect.foreach( println )
// println( "####" )
val ranks =
graph
.pageRank( 0.0001 )
.vertices
ranks.foreach( println )
println( "####" )
val identificationMap =
file
.flatMap( line => line.split( "\t" ) )
.distinct
.map( line => ( MurmurHash3.stringHash( line.toString ).toLong, line ) )
identificationMap.foreach( println )
println( "####" )
val fullMap =
ranks
.join( identificationMap )
fullMap.foreach( println )
sc.stop()
}
}
结果
(-1578471469,1.2982456140350878)
(1547760250,0.7017543859649124)
(1657711982,1.0000000000000002)
(1797439709,0.7017543859649124)
(996122257,0.7017543859649124)
(-1127017098,1.5964912280701753)
####
(1547760250,community.cloudera.com/t5/)
(-1127017098,link.de/2011/10/extact-host)
(1657711982,facebook.de/2014/11/name.jpg)
(1797439709,facebook.de/2014/11/photos)
(-1578471469,community.cloudera.com/t10/)
(996122257,google.de/2011/10/Extract-host)
####
(-1578471469,(1.2982456140350878,community.cloudera.com/t10/))
(1797439709,(0.7017543859649124,facebook.de/2014/11/photos))
(1547760250,(0.7017543859649124,community.cloudera.com/t5/))
(996122257,(0.7017543859649124,google.de/2011/10/Extract-host))
(1657711982,(1.0000000000000002,facebook.de/2014/11/name.jpg))
(-1127017098,(1.5964912280701753,link.de/2011/10/extact-host))
您可以通过映射将散列 ID 从 RDD 中删除,但我相信 PageRank 不是您的最终目标,因此您稍后可能需要它们。