Spark 的过滤操作是如何作用于 GraphX 边的？

Question

我是 Spark 的新手，并不太了解基础知识，我只是跳入其中来解决问题。该问题的解决方案涉及制作一个图（使用 GraphX），其中边具有字符串属性。用户可能希望查询此图，我通过仅过滤掉那些具有等于用户查询的字符串属性的边来处理查询。

现在，我的图有超过 1600 万条边；当我使用计算机的所有 8 个内核时，创建图表需要 10 多分钟。然而，当我查询这个图表时（就像我上面提到的那样），我会立即得到结果（令我惊喜的是）。

所以，我的问题是，过滤操作究竟是如何搜索我查询的边的？它会反复查看它们吗？是否在多个内核上搜索边缘并且看起来非常快？或者是否涉及某种散列？

这是我如何使用过滤器的示例：Mygraph.edges.filter(_.attr(0).equals("cat")) 这意味着我想检索边缘其中包含 "cat" 属性。如何搜索边缘？

Answer 1

如何过滤结果是即时的？

运行你的语句 returns 这么快是因为它实际上并没有执行过滤。 Spark 使用惰性评估：在您执行实际收集结果的操作之前，它不会实际执行转换。调用一个转换方法，比如 filter 只是创建一个新的 RDD 来表示这个转换及其结果。您将必须执行操作，例如 collect 或 count 才能实际执行它：

def myGraph: Graph = ???

// No filtering actually happens yet here, the results aren't needed yet so Spark is lazy and doesn't do anything
val filteredEdges = myGraph.edges.filter()

// Counting how many edges are left requires the results to actually be instantiated, so this fires off the actual filtering
println(filteredEdges.count)

// Actually gathering all results also requires the filtering to be done
val collectedFilteredEdges = filteredEdges.collect

请注意，在这些示例中，过滤结果并未存储在两者之间：由于懒惰，过滤会针对两个操作重复进行。为防止重复，您应该在阅读有关转换和操作的详细信息以及 Spark 在幕后实际执行的操作后，研究 Spark 的缓存功能：https://spark.apache.org/docs/latest/programming-guide.html#rdd-operations.

过滤操作究竟如何搜索我查询的边（当我执行一个动作时）？

在 Spark GraphX 中，边存储在 EdgeRDD[ED] 类型的 RDD 中，其中 ED 是边属性的类型，在您的例子中是 String。这个特殊的 RDD 在后台做了一些特殊的优化，但是为了你的目的，它的行为就像它的超类 RDD[Edge[ED]] 并且过滤就像过滤任何 RDD 一样发生：它将遍历所有项目，将给定的谓词应用于每个项目。然而，RDD 被拆分为多个分区，Spark 将并行过滤多个分区；在您似乎运行本地 Spark 的情况下，它会并行执行与您拥有的核心数量一样多的任务，或者您使用 --master local[4] 明确指定的数量。

带边的 RDD 基于 PartitionStrategy that is set, for instance if you create your graph with Graph.fromEdgeTuples or by calling partitionBy on your graph. All strategies are based on the edge's vertices however, so don't have any knowledge about your attribute, and so don't affect your filtering operation, except maybe for some unbalanced network load if you'd run it on a cluster, all 'cat' edges end up in the same partition/executor and you do a collect or some shuffle operation. See the GraphX docs on Vertex and Edge RDDs 进行分区，以获取有关如何表示和分区图的更多信息。

Spark 的过滤操作是如何作用于 GraphX 边的？

How does the filter operation of Spark work on GraphX edges?

apache-spark

spark-graphx

如何过滤结果是即时的？

过滤操作究竟如何搜索我查询的边（当我执行一个动作时）？