如何从数据框构建图表? (图X)
How to build a graph from a dataframe ? (GraphX)
我是 scala 和 spark 的新手,我需要从数据框构建图表。这是我的数据框的结构,其中 S 和 O 是节点,列 P 表示边。
+---------------------------+---------------------+----------------------------+
|S |P |O |
+---------------------------+---------------------+----------------------------+
|http://website/Jimmy_Carter|http://web/name |James Earl Carter |
|http://website/Jimmy_Car |http://web/country |http://website/United_States|
|http://website/Jimmy_Car |http://web/birthPlace|http://web/Georgia_(US) |
+---------------------------+---------------------+----------------------------+
这是数据框的代码,我想从数据框创建一个图表 "dfA"
val test = sc
.textFile("testfile.ttl")
.map(_.split(" "))
.map(p => Triple(Try(p(0).toString()).toOption,
Try(p(1).toString()).toOption,
Try(p(2).toString()).toOption))
.toDF()
val url_regex = """^(?:"|<{1}\s?)(.*)(?:>(?:\s\.)?|,\s.*)$"""
val dfA = test
.withColumn("Subject", regexp_extract($"Subject", url_regex, 1))
.withColumn("Predicate", regexp_extract($"Predicate", url_regex, 1))
.withColumn("Object", regexp_extract($"Object", url_regex, 1))
要创建 GraphX
图,您需要从数据框中提取顶点并将它们与 ID 相关联。然后,您需要使用这些 ID 提取边(2 元组顶点 + 元数据)。所有这些都需要在 RDD 中,而不是数据帧中。
换句话说,您需要 RDD[(VertexId, X)]
表示顶点,RDD[Edge(VertexId, VertexId, Y)]
其中 X
是顶点元数据,Y
是边元数据。请注意 VertexId
只是 Long
.
的别名
在你的例子中,"S" 和 "O" 顶点列和 "P" 边缘列,它将如下所示。
// Let's create the vertex RDD.
val vertices : RDD[(VertexId, String)] = df
.select(explode(array('S, 'O))) // S and O are the vertices
.distinct // we remove duplicates
.rdd.map(_.getAs[String](0)) // transform to RDD
.zipWithIndex // associate a long index to each vertex
.map(_.swap)
// Now let's define a vertex dataframe because joins are clearer in sparkSQL
val vertexDf = vertices.toDF("id", "node")
// And let's extract the edges and join their vertices with their respective IDs
val edges : RDD[Edge(VertexId, VertexId, String)] = df
.join(vertexDf, df("S") === vertexDf("node")) // getting the IDs for "S"
.select('P, 'O, 'id as 'idS)
.join(vertexDf, df("O") === vertexDf("node")) // getting the IDs for "O"
.rdd.map(row => // creating the edge using column "P" as metadata
Edge(row.getAs[Long]("idS"), row.getAs[Long]("id"), row.getAs[String]("P")))
// And finally
val graph = Graph(vertices, edges)
我是 scala 和 spark 的新手,我需要从数据框构建图表。这是我的数据框的结构,其中 S 和 O 是节点,列 P 表示边。
+---------------------------+---------------------+----------------------------+
|S |P |O |
+---------------------------+---------------------+----------------------------+
|http://website/Jimmy_Carter|http://web/name |James Earl Carter |
|http://website/Jimmy_Car |http://web/country |http://website/United_States|
|http://website/Jimmy_Car |http://web/birthPlace|http://web/Georgia_(US) |
+---------------------------+---------------------+----------------------------+
这是数据框的代码,我想从数据框创建一个图表 "dfA"
val test = sc
.textFile("testfile.ttl")
.map(_.split(" "))
.map(p => Triple(Try(p(0).toString()).toOption,
Try(p(1).toString()).toOption,
Try(p(2).toString()).toOption))
.toDF()
val url_regex = """^(?:"|<{1}\s?)(.*)(?:>(?:\s\.)?|,\s.*)$"""
val dfA = test
.withColumn("Subject", regexp_extract($"Subject", url_regex, 1))
.withColumn("Predicate", regexp_extract($"Predicate", url_regex, 1))
.withColumn("Object", regexp_extract($"Object", url_regex, 1))
要创建 GraphX
图,您需要从数据框中提取顶点并将它们与 ID 相关联。然后,您需要使用这些 ID 提取边(2 元组顶点 + 元数据)。所有这些都需要在 RDD 中,而不是数据帧中。
换句话说,您需要 RDD[(VertexId, X)]
表示顶点,RDD[Edge(VertexId, VertexId, Y)]
其中 X
是顶点元数据,Y
是边元数据。请注意 VertexId
只是 Long
.
在你的例子中,"S" 和 "O" 顶点列和 "P" 边缘列,它将如下所示。
// Let's create the vertex RDD.
val vertices : RDD[(VertexId, String)] = df
.select(explode(array('S, 'O))) // S and O are the vertices
.distinct // we remove duplicates
.rdd.map(_.getAs[String](0)) // transform to RDD
.zipWithIndex // associate a long index to each vertex
.map(_.swap)
// Now let's define a vertex dataframe because joins are clearer in sparkSQL
val vertexDf = vertices.toDF("id", "node")
// And let's extract the edges and join their vertices with their respective IDs
val edges : RDD[Edge(VertexId, VertexId, String)] = df
.join(vertexDf, df("S") === vertexDf("node")) // getting the IDs for "S"
.select('P, 'O, 'id as 'idS)
.join(vertexDf, df("O") === vertexDf("node")) // getting the IDs for "O"
.rdd.map(row => // creating the edge using column "P" as metadata
Edge(row.getAs[Long]("idS"), row.getAs[Long]("id"), row.getAs[String]("P")))
// And finally
val graph = Graph(vertices, edges)