将边缘的 spark 数据框转换为 graphx 图

Question

我有这样一个数据框：

> |Id1 |Id2 |attr1 |attr2 |attr3| 
>  ----:----:------:------:-----: 
> |1   |2   |1     |0     |.5   | 
> |1   |3   |1     |1     |.33  | 
> |2   |3   |0     |.6    |.7   |

我想用 table 中的值的权重为非零属性创建边？我该怎么做呢？我似乎找不到任何简单的方法，所以现在我只是使用 for 循环并遍历每一行，但这似乎效率低下。谢谢！

Answer 1

三个属性列可以stacked. After filtering the resulting column for nonzero values a GraphFrame构造为没有边且权重为零：

df = ...
edges = df.withColumn("weight", F.expr("stack(3,cast(attr1 as double),cast(attr2 as double),cast(attr3 as double))"))\
      .drop("attr1","attr2","attr3") \
      .filter("weight <> 0.0") \
      .withColumnRenamed("Id1", "src") \
      .withColumnRenamed("Id2", "dst")

vertices = edges.selectExpr("src as id").union(edges.selectExpr("dst as id")).distinct()

from graphframes import GraphFrame

g = GraphFrame(vertices, edges)

作为测试，可以检查每个顶点的入度：

g.inDegrees.show()

打印

+---+--------+
| id|inDegree|
+---+--------+
|  3|       5|
|  2|       2|
+---+--------+

这个结果与给定的数据一致：顶点2有两个来自示例数据第一行的入边，顶点3有三个来自第二行数据的入边和两个第三行的边缘。

将边缘的 spark 数据框转换为 graphx 图

Turning a spark dataframe of edges into a graphx graph

python

apache-spark

spark-graphx

pyspark