GraphFrames api 是否支持创建二分图?
Does GraphFrames api support creation of Bipartite graphs?
GraphFrames api 是否支持在当前版本中创建二分图?
当前版本:0.1.0
Spark 版本:1.6.1
正如对该问题的评论中所指出的,GraphFrames 和 GraphX 都没有对二分图的内置支持。但是,它们都具有足够的灵活性来让您创建二分图。对于 GraphX 解决方案,请参阅 。该解决方案使用不同顶点/对象类型之间的共享特征。虽然这适用于 RDDs
,但不适用于 DataFrames
。 DataFrame
中的一行有一个固定的模式——它有时不能包含 price
列,有时不能。它可以有一个 price
列,有时是 null
,但该列必须存在于每一行中。
相反,GraphFrames
的解决方案似乎是您需要定义一个 DataFrame
,它本质上是二分图中两种类型对象的线性子类型——它必须包含两种类型对象的所有字段。这实际上非常简单 - join
和 full_outer
将为您提供。像这样:
val players = Seq(
(1,"dave", 34),
(2,"griffin", 44)
).toDF("id", "name", "age")
val teams = Seq(
(101,"lions","7-1"),
(102,"tigers","5-3"),
(103,"bears","0-9")
).toDF("id","team","record")
然后您可以像这样创建一个超集 DataFrame
:
val teamPlayer = players.withColumnRenamed("id", "l_id").join(
teams.withColumnRenamed("id", "r_id"),
$"r_id" === $"l_id", "full_outer"
).withColumn("l_id", coalesce($"l_id", $"r_id"))
.drop($"r_id")
.withColumnRenamed("l_id", "id")
teamPlayer.show
+---+-------+----+------+------+
| id| name| age| team|record|
+---+-------+----+------+------+
|101| null|null| lions| 7-1|
|102| null|null|tigers| 5-3|
|103| null|null| bears| 0-9|
| 1| dave| 34| null| null|
| 2|griffin| 44| null| null|
+---+-------+----+------+------+
你可以用 structs
:
做的更干净一些
val tpStructs = players.select($"id" as "l_id", struct($"name", $"age") as "player").join(
teams.select($"id" as "r_id", struct($"team",$"record") as "team"),
$"l_id" === $"r_id",
"full_outer"
).withColumn("l_id", coalesce($"l_id", $"r_id"))
.drop($"r_id")
.withColumnRenamed("l_id", "id")
tpStructs.show
+---+------------+------------+
| id| player| team|
+---+------------+------------+
|101| null| [lions,7-1]|
|102| null|[tigers,5-3]|
|103| null| [bears,0-9]|
| 1| [dave,34]| null|
| 2|[griffin,44]| null|
+---+------------+------------+
我还要指出,在 GraphX
和 RDDs
中或多或少可以使用相同的解决方案。您始终可以通过连接两个不共享任何 traits
:
的 case classes
创建一个顶点
case class Player(name: String, age: Int)
val playerRdd = sc.parallelize(Seq(
(1L, Player("date", 34)),
(2L, Player("griffin", 44))
))
case class Team(team: String, record: String)
val teamRdd = sc.parallelize(Seq(
(101L, Team("lions", "7-1")),
(102L, Team("tigers", "5-3")),
(103L, Team("bears", "0-9"))
))
playerRdd.fullOuterJoin(teamRdd).collect foreach println
(101,(None,Some(Team(lions,7-1))))
(1,(Some(Player(date,34)),None))
(102,(None,Some(Team(tigers,5-3))))
(2,(Some(Player(griffin,44)),None))
(103,(None,Some(Team(bears,0-9))))
考虑到之前的答案,这似乎是一种更灵活的处理方式——无需在组合对象之间共享 trait
。
GraphFrames api 是否支持在当前版本中创建二分图?
当前版本:0.1.0
Spark 版本:1.6.1
正如对该问题的评论中所指出的,GraphFrames 和 GraphX 都没有对二分图的内置支持。但是,它们都具有足够的灵活性来让您创建二分图。对于 GraphX 解决方案,请参阅 RDDs
,但不适用于 DataFrames
。 DataFrame
中的一行有一个固定的模式——它有时不能包含 price
列,有时不能。它可以有一个 price
列,有时是 null
,但该列必须存在于每一行中。
相反,GraphFrames
的解决方案似乎是您需要定义一个 DataFrame
,它本质上是二分图中两种类型对象的线性子类型——它必须包含两种类型对象的所有字段。这实际上非常简单 - join
和 full_outer
将为您提供。像这样:
val players = Seq(
(1,"dave", 34),
(2,"griffin", 44)
).toDF("id", "name", "age")
val teams = Seq(
(101,"lions","7-1"),
(102,"tigers","5-3"),
(103,"bears","0-9")
).toDF("id","team","record")
然后您可以像这样创建一个超集 DataFrame
:
val teamPlayer = players.withColumnRenamed("id", "l_id").join(
teams.withColumnRenamed("id", "r_id"),
$"r_id" === $"l_id", "full_outer"
).withColumn("l_id", coalesce($"l_id", $"r_id"))
.drop($"r_id")
.withColumnRenamed("l_id", "id")
teamPlayer.show
+---+-------+----+------+------+
| id| name| age| team|record|
+---+-------+----+------+------+
|101| null|null| lions| 7-1|
|102| null|null|tigers| 5-3|
|103| null|null| bears| 0-9|
| 1| dave| 34| null| null|
| 2|griffin| 44| null| null|
+---+-------+----+------+------+
你可以用 structs
:
val tpStructs = players.select($"id" as "l_id", struct($"name", $"age") as "player").join(
teams.select($"id" as "r_id", struct($"team",$"record") as "team"),
$"l_id" === $"r_id",
"full_outer"
).withColumn("l_id", coalesce($"l_id", $"r_id"))
.drop($"r_id")
.withColumnRenamed("l_id", "id")
tpStructs.show
+---+------------+------------+
| id| player| team|
+---+------------+------------+
|101| null| [lions,7-1]|
|102| null|[tigers,5-3]|
|103| null| [bears,0-9]|
| 1| [dave,34]| null|
| 2|[griffin,44]| null|
+---+------------+------------+
我还要指出,在 GraphX
和 RDDs
中或多或少可以使用相同的解决方案。您始终可以通过连接两个不共享任何 traits
:
case classes
创建一个顶点
case class Player(name: String, age: Int)
val playerRdd = sc.parallelize(Seq(
(1L, Player("date", 34)),
(2L, Player("griffin", 44))
))
case class Team(team: String, record: String)
val teamRdd = sc.parallelize(Seq(
(101L, Team("lions", "7-1")),
(102L, Team("tigers", "5-3")),
(103L, Team("bears", "0-9"))
))
playerRdd.fullOuterJoin(teamRdd).collect foreach println
(101,(None,Some(Team(lions,7-1))))
(1,(Some(Player(date,34)),None))
(102,(None,Some(Team(tigers,5-3))))
(2,(Some(Player(griffin,44)),None))
(103,(None,Some(Team(bears,0-9))))
考虑到之前的答案,这似乎是一种更灵活的处理方式——无需在组合对象之间共享 trait
。