在边中存储多列数据,在 Spark 中存储顶点
Storing Multiple Columns data in Edge and Vertices in Spark
我是 Spark Graphx 的新手,边数据框为:
Dataframe : edges_main
+------------------+------------------+------------+--------+-----------+
| src| dst|relationship|category|subcategory|
+------------------+------------------+------------+--------+-----------+
|294201130817328347|294201131015844283| friend | school| class|
|294201131015844283|294201131007361339| brother | home | cousin|
|294201131015844283|294201131014451003| son | home | relative|
-------------------------------------------------------------------------
顶点为:
Dataframe : vertices_main
+------------------+----------+
| id |value|name|
+------------------+----------+
|294201130817328347|Mary |a |
|294201131015844283|Hola |b |
|294201131015844283|Rama |c |
-------------------------------
我想在 Graphx 中保留额外的属性,以便我可以使用 map
访问它们。我的代码:
case class MyEdges(src: String, dst: String, attributes: MyEdgesLabel)
case class MyEdgesLabel(relationship:String,category: String ,subcategory:String)
val edges = edges_main.as[MyEdges].rdd.map { edge =>
Edge(
edge.src.toLong,
edge.dst.toLong,
//**what to mention here(MyEdgesLabel)**//
)}
case class MyVerticesLabel(name:String)
val vertices: RDD[(VertexId, Any)] = vertices_data.rdd.map(verticesRow => (
verticesRow.getLong(0),
verticesRow.getString(1))
//**what to mention here(MyVerticesLabel)**//
)
以上需求的原因是创建图表后,我可以通过以下方式直接访问其他属性:
val g = Graph(vertices, edges)
g.vertices.map(v => v._1 + v._2 + /*addidtional attributes which is in case class MyEdgesLabel*/).collect.mkString
g.edges.map(e => e.srcId + e.dstId + e.attr(/*addidtional attributes which is in case class
MyVerticesLabel*/))).collect.mkString
我从下面得到了一些线索 url 但我仍然对在顶点和边中提供多个属性感到困惑:
http://www.sunlab.org/teaching/cse6250/fall2019/spark/spark-graphx.html#graph-construction.
请帮忙解决这个问题。
您可以使用一个案例 class 作为边属性,另一个作为顶点 属性。 MyEdgesLabel
已经可以用于边缘,创建边缘 RDD
,只需执行:
val edges = edges_main.as[MyEdges].rdd.map { edge =>
Edge(
edge.src.toLong,
edge.dst.toLong,
MyEdgesLabel(edge.relationship, edge.category, edge.subcategory)
)}
对于顶点,在 class:
的情况下,您需要同时包含 value
和 name
case class MyVerticesLabel(value: String, name: String)
然后用它来创建顶点RDD
:
val vertices: RDD[(VertexId, MyVerticesLabel)] = vertices_data.rdd.map{verticesRow =>
(verticesRow.getAs[Long]("id"),
MyVerticesLabel(verticesRow.getAs[String]("value"), verticesRow.getAs[String]("name")))
}
现在,可以轻松访问这些值,例如:
g.edges.map(e => e.srcId + e.dstId + e.attr.relationship).collect.mkString
我是 Spark Graphx 的新手,边数据框为:
Dataframe : edges_main
+------------------+------------------+------------+--------+-----------+
| src| dst|relationship|category|subcategory|
+------------------+------------------+------------+--------+-----------+
|294201130817328347|294201131015844283| friend | school| class|
|294201131015844283|294201131007361339| brother | home | cousin|
|294201131015844283|294201131014451003| son | home | relative|
-------------------------------------------------------------------------
顶点为:
Dataframe : vertices_main
+------------------+----------+
| id |value|name|
+------------------+----------+
|294201130817328347|Mary |a |
|294201131015844283|Hola |b |
|294201131015844283|Rama |c |
-------------------------------
我想在 Graphx 中保留额外的属性,以便我可以使用 map
访问它们。我的代码:
case class MyEdges(src: String, dst: String, attributes: MyEdgesLabel)
case class MyEdgesLabel(relationship:String,category: String ,subcategory:String)
val edges = edges_main.as[MyEdges].rdd.map { edge =>
Edge(
edge.src.toLong,
edge.dst.toLong,
//**what to mention here(MyEdgesLabel)**//
)}
case class MyVerticesLabel(name:String)
val vertices: RDD[(VertexId, Any)] = vertices_data.rdd.map(verticesRow => (
verticesRow.getLong(0),
verticesRow.getString(1))
//**what to mention here(MyVerticesLabel)**//
)
以上需求的原因是创建图表后,我可以通过以下方式直接访问其他属性:
val g = Graph(vertices, edges)
g.vertices.map(v => v._1 + v._2 + /*addidtional attributes which is in case class MyEdgesLabel*/).collect.mkString
g.edges.map(e => e.srcId + e.dstId + e.attr(/*addidtional attributes which is in case class
MyVerticesLabel*/))).collect.mkString
我从下面得到了一些线索 url 但我仍然对在顶点和边中提供多个属性感到困惑: http://www.sunlab.org/teaching/cse6250/fall2019/spark/spark-graphx.html#graph-construction.
请帮忙解决这个问题。
您可以使用一个案例 class 作为边属性,另一个作为顶点 属性。 MyEdgesLabel
已经可以用于边缘,创建边缘 RDD
,只需执行:
val edges = edges_main.as[MyEdges].rdd.map { edge =>
Edge(
edge.src.toLong,
edge.dst.toLong,
MyEdgesLabel(edge.relationship, edge.category, edge.subcategory)
)}
对于顶点,在 class:
的情况下,您需要同时包含value
和 name
case class MyVerticesLabel(value: String, name: String)
然后用它来创建顶点RDD
:
val vertices: RDD[(VertexId, MyVerticesLabel)] = vertices_data.rdd.map{verticesRow =>
(verticesRow.getAs[Long]("id"),
MyVerticesLabel(verticesRow.getAs[String]("value"), verticesRow.getAs[String]("name")))
}
现在,可以轻松访问这些值,例如:
g.edges.map(e => e.srcId + e.dstId + e.attr.relationship).collect.mkString