用 Null 值填充的 GraphX 顶点
GraphX Vertices populated with Null values
我正在 Spark GraphX 中尝试一段代码,但在使用 Null 时遇到困难。
scala> verticesRDD
res76: org.apache.spark.rdd.RDD[(Long, (String, Long))] = MapPartitionsRDD[78] at map at <console>:51
scala> EdgesRDD
res77: org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[Boolean]] = MapPartitionsRDD[18] at map at <console>:41
val graph = Graph(verticesRDD, EdgesRDD).cache()
scala> graph
res75: org.apache.spark.graphx.Graph[(String, Long),Boolean] = org.apache.spark.graphx.impl.GraphImpl@9533103
如果我提取顶点属性,我会得到一些空值。
val x = graph.vertices.map{case(id, v) => v}
scala> x
res78: org.apache.spark.rdd.RDD[(String, Long)] = MapPartitionsRDD[149] at map at <console>:56
scala> x.filter(_ == null).count()
res79: Long = 8999
在源 verticesRDD 中没有 NUll。
val x = verticesRDD.map{case(id,v) => v}
scala> x
res80: org.apache.spark.rdd.RDD[(String, Long)] = MapPartitionsRDD[151] at map at <console>:54
scala> x.filter(_ == null).count()
res81: Long = 0
我无法理解为什么在顶点的源 RDD 中顶点值没有空值时顶点值可以为空?
如果您能对此提供一些见解,我将非常有帮助。
谢谢
当 verticesRDD
的 VertexIds
和 EdgesRDD
不匹配时,将为不匹配的 vertexId 创建一个空顶点。这就是你在 Graph
中有空值的原因,尽管你在 verticesRDD
中没有空值。
用简单的例子会更明显
scala> import org.apache.spark.graphx._
import org.apache.spark.graphx._
scala> import org.apache.spark.rdd.RDD
import org.apache.spark.rdd.RDD
scala> val verticesRDD: RDD[(Long, (String, Long))] = sc.parallelize(Seq((0L, ("Subhasis", 0L))))
verticesRDD: org.apache.spark.rdd.RDD[(Long, (String, Long))] = ParallelCollectionRDD[0] at parallelize at <console>:28
scala> val EdgesRDD: RDD[Edge[Boolean]] = sc.parallelize(Seq(Edge(1L, 0L, true)))
EdgesRDD: org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[Boolean]] = ParallelCollectionRDD[1] at parallelize at <console>:28
scala> val graph = Graph(verticesRDD, edgesRDD)
graph: org.apache.spark.graphx.Graph[(String, Long),Boolean] = org.apache.spark.graphx.impl.GraphImpl@5563a63f
scala> graph.vertices.foreach(println)
[Stage 2:> (0 + 0) / 4](1,null)
(0,(Subhasis,0))
18/04/20 08:26:22 WARN Executor: 1 block locks were not released by TID = 9:
[rdd_9_3]
18/04/20 08:26:22 WARN Executor: 1 block locks were not released by TID = 8:
[rdd_9_2]
18/04/20 08:26:22 WARN Executor: 1 block locks were not released by TID = 10:
[rdd_9_0]
18/04/20 08:26:22 WARN Executor: 1 block locks were not released by TID = 11:
[rdd_9_1]
scala>
你可以清楚地看到(1,null)
是为图中EdgesRDD的非匹配vertexId创建的
我希望解释清楚并有所帮助
我正在 Spark GraphX 中尝试一段代码,但在使用 Null 时遇到困难。
scala> verticesRDD
res76: org.apache.spark.rdd.RDD[(Long, (String, Long))] = MapPartitionsRDD[78] at map at <console>:51
scala> EdgesRDD
res77: org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[Boolean]] = MapPartitionsRDD[18] at map at <console>:41
val graph = Graph(verticesRDD, EdgesRDD).cache()
scala> graph
res75: org.apache.spark.graphx.Graph[(String, Long),Boolean] = org.apache.spark.graphx.impl.GraphImpl@9533103
如果我提取顶点属性,我会得到一些空值。
val x = graph.vertices.map{case(id, v) => v}
scala> x
res78: org.apache.spark.rdd.RDD[(String, Long)] = MapPartitionsRDD[149] at map at <console>:56
scala> x.filter(_ == null).count()
res79: Long = 8999
在源 verticesRDD 中没有 NUll。
val x = verticesRDD.map{case(id,v) => v}
scala> x
res80: org.apache.spark.rdd.RDD[(String, Long)] = MapPartitionsRDD[151] at map at <console>:54
scala> x.filter(_ == null).count()
res81: Long = 0
我无法理解为什么在顶点的源 RDD 中顶点值没有空值时顶点值可以为空?
如果您能对此提供一些见解,我将非常有帮助。
谢谢
当 verticesRDD
的 VertexIds
和 EdgesRDD
不匹配时,将为不匹配的 vertexId 创建一个空顶点。这就是你在 Graph
中有空值的原因,尽管你在 verticesRDD
中没有空值。
用简单的例子会更明显
scala> import org.apache.spark.graphx._
import org.apache.spark.graphx._
scala> import org.apache.spark.rdd.RDD
import org.apache.spark.rdd.RDD
scala> val verticesRDD: RDD[(Long, (String, Long))] = sc.parallelize(Seq((0L, ("Subhasis", 0L))))
verticesRDD: org.apache.spark.rdd.RDD[(Long, (String, Long))] = ParallelCollectionRDD[0] at parallelize at <console>:28
scala> val EdgesRDD: RDD[Edge[Boolean]] = sc.parallelize(Seq(Edge(1L, 0L, true)))
EdgesRDD: org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[Boolean]] = ParallelCollectionRDD[1] at parallelize at <console>:28
scala> val graph = Graph(verticesRDD, edgesRDD)
graph: org.apache.spark.graphx.Graph[(String, Long),Boolean] = org.apache.spark.graphx.impl.GraphImpl@5563a63f
scala> graph.vertices.foreach(println)
[Stage 2:> (0 + 0) / 4](1,null)
(0,(Subhasis,0))
18/04/20 08:26:22 WARN Executor: 1 block locks were not released by TID = 9:
[rdd_9_3]
18/04/20 08:26:22 WARN Executor: 1 block locks were not released by TID = 8:
[rdd_9_2]
18/04/20 08:26:22 WARN Executor: 1 block locks were not released by TID = 10:
[rdd_9_0]
18/04/20 08:26:22 WARN Executor: 1 block locks were not released by TID = 11:
[rdd_9_1]
scala>
你可以清楚地看到(1,null)
是为图中EdgesRDD的非匹配vertexId创建的
我希望解释清楚并有所帮助