在边中存储多列数据，在 Spark 中存储顶点

Question

我是 Spark Graphx 的新手，边数据框为：

Dataframe : edges_main
+------------------+------------------+------------+--------+-----------+
|               src|               dst|relationship|category|subcategory|
+------------------+------------------+------------+--------+-----------+
|294201130817328347|294201131015844283|   friend   |  school|      class|
|294201131015844283|294201131007361339|  brother   |   home |     cousin|
|294201131015844283|294201131014451003|  son       |   home |   relative|
-------------------------------------------------------------------------

顶点为：

Dataframe : vertices_main
+------------------+----------+
|               id |value|name|
+------------------+----------+
|294201130817328347|Mary |a   |
|294201131015844283|Hola |b   |
|294201131015844283|Rama |c   |
-------------------------------

我想在 Graphx 中保留额外的属性，以便我可以使用 map 访问它们。我的代码：

case class MyEdges(src: String, dst: String, attributes: MyEdgesLabel)
case class MyEdgesLabel(relationship:String,category: String ,subcategory:String)

val edges = edges_main.as[MyEdges].rdd.map { edge =>
      Edge(
        edge.src.toLong,
        edge.dst.toLong,
        //**what to mention here(MyEdgesLabel)**//
      )}

case class MyVerticesLabel(name:String)

val vertices: RDD[(VertexId, Any)] = vertices_data.rdd.map(verticesRow => (
      verticesRow.getLong(0),
      verticesRow.getString(1))
//**what to mention here(MyVerticesLabel)**//
    )

以上需求的原因是创建图表后，我可以通过以下方式直接访问其他属性：

val g = Graph(vertices, edges)
g.vertices.map(v => v._1 + v._2 + /*addidtional attributes which is in case class MyEdgesLabel*/).collect.mkString 
g.edges.map(e =>  e.srcId + e.dstId + e.attr(/*addidtional attributes which is in case class 
 MyVerticesLabel*/))).collect.mkString

我从下面得到了一些线索 url 但我仍然对在顶点和边中提供多个属性感到困惑： http://www.sunlab.org/teaching/cse6250/fall2019/spark/spark-graphx.html#graph-construction.

请帮忙解决这个问题。

Answer 1

您可以使用一个案例 class 作为边属性，另一个作为顶点属性。 MyEdgesLabel 已经可以用于边缘，创建边缘 RDD，只需执行：

val edges = edges_main.as[MyEdges].rdd.map { edge =>
      Edge(
        edge.src.toLong,
        edge.dst.toLong,
        MyEdgesLabel(edge.relationship, edge.category, edge.subcategory)
      )}

对于顶点，在 class:

的情况下，您需要同时包含 value 和 name

case class MyVerticesLabel(value: String, name: String)

然后用它来创建顶点RDD:

val vertices: RDD[(VertexId, MyVerticesLabel)] = vertices_data.rdd.map{verticesRow => 
    (verticesRow.getAs[Long]("id"),
    MyVerticesLabel(verticesRow.getAs[String]("value"), verticesRow.getAs[String]("name")))
}

现在，可以轻松访问这些值，例如：

g.edges.map(e =>  e.srcId + e.dstId + e.attr.relationship).collect.mkString

在边中存储多列数据，在 Spark 中存储顶点

Storing Multiple Columns data in Edge and Vertices in Spark

scala

graph

apache-spark

spark-graphx