如何在我的示例中加入数据框?
How to join data frames in my example?
我有两个数据框:
edges =
srcId dstId timestamp
1 3 1345534569
1 4 1346564657
1 2 1345769687
2 3 1345769687
4 3 1345769687
vertices =
id name s_type
1 abc A
2 def B
3 rtf C
4 wrr D
我想获取以下结构的数据框(第一行的示例):
result =
srcId name_src s_type_src dstId name_dst s_type_dst timestamp
1 abc A 3 rtf C 1345534569
换句话说,我想将前缀 _src
添加到由 srcId
连接的列。我想将前缀 _dst
添加到由 dstId
连接的列。
这是我解决任务的方法,但我不知道如何为列名分配 _src
和 _dst
前缀:
val result = edges
.join(vertices, col("srcId")===col("id"),"inner")
.join(vertices, col("dstId")===col("id"),"inner")
您可以简单地 select
通过 as()
别名的列:
val edges = Seq(
(1, 3, 1345534569),
(1, 4, 1346564657),
(1, 2, 1345769687),
(2, 3, 1345769687),
(4, 3, 1345769687)
).toDF("srcId", "dstId", "timestamp")
val vertices = Seq(
(1, "abc", "A"),
(2, "def", "B"),
(3, "rtf", "C"),
(4, "wrr", "D")
).toDF("id", "name", "s_type")
import org.apache.spark.sql.functions._
val result = edges.
join(vertices.as("s"), $"srcId" === $"s.id", "inner").
join(vertices.as("d"), $"dstId" === $"d.id", "inner").
select(
$"srcId", $"s.name".as("name_src"), $"s.s_type".as("s_type_src"),
$"dstId", $"d.name".as("name_dst"), $"d.s_type".as("s_type_dst"),
$"timestamp"
)
result.show
// +-----+--------+----------+-----+--------+----------+----------+
// |srcId|name_src|s_type_src|dstId|name_dst|s_type_dst| timestamp|
// +-----+--------+----------+-----+--------+----------+----------+
// | 1| abc| A| 3| rtf| C|1345534569|
// | 1| abc| A| 4| wrr| D|1346564657|
// | 1| abc| A| 2| def| B|1345769687|
// | 2| def| B| 3| rtf| C|1345769687|
// | 4| wrr| D| 3| rtf| C|1345769687|
// +-----+--------+----------+-----+--------+----------+----------+
或者,您可以在加入 vertices
列之前相应地重命名它们,如下所示:
val cols = vertices.columns
val v_src = vertices.toDF(cols.map(_ + "_src"): _*)
val v_dst = vertices.toDF(cols.map(_ + "_dst"): _*)
val result = edges.
join(v_src, $"srcId" === $"id_src", "inner").
join(v_dst, $"dstId" === $"id_dst", "inner")
我有两个数据框:
edges =
srcId dstId timestamp
1 3 1345534569
1 4 1346564657
1 2 1345769687
2 3 1345769687
4 3 1345769687
vertices =
id name s_type
1 abc A
2 def B
3 rtf C
4 wrr D
我想获取以下结构的数据框(第一行的示例):
result =
srcId name_src s_type_src dstId name_dst s_type_dst timestamp
1 abc A 3 rtf C 1345534569
换句话说,我想将前缀 _src
添加到由 srcId
连接的列。我想将前缀 _dst
添加到由 dstId
连接的列。
这是我解决任务的方法,但我不知道如何为列名分配 _src
和 _dst
前缀:
val result = edges
.join(vertices, col("srcId")===col("id"),"inner")
.join(vertices, col("dstId")===col("id"),"inner")
您可以简单地 select
通过 as()
别名的列:
val edges = Seq(
(1, 3, 1345534569),
(1, 4, 1346564657),
(1, 2, 1345769687),
(2, 3, 1345769687),
(4, 3, 1345769687)
).toDF("srcId", "dstId", "timestamp")
val vertices = Seq(
(1, "abc", "A"),
(2, "def", "B"),
(3, "rtf", "C"),
(4, "wrr", "D")
).toDF("id", "name", "s_type")
import org.apache.spark.sql.functions._
val result = edges.
join(vertices.as("s"), $"srcId" === $"s.id", "inner").
join(vertices.as("d"), $"dstId" === $"d.id", "inner").
select(
$"srcId", $"s.name".as("name_src"), $"s.s_type".as("s_type_src"),
$"dstId", $"d.name".as("name_dst"), $"d.s_type".as("s_type_dst"),
$"timestamp"
)
result.show
// +-----+--------+----------+-----+--------+----------+----------+
// |srcId|name_src|s_type_src|dstId|name_dst|s_type_dst| timestamp|
// +-----+--------+----------+-----+--------+----------+----------+
// | 1| abc| A| 3| rtf| C|1345534569|
// | 1| abc| A| 4| wrr| D|1346564657|
// | 1| abc| A| 2| def| B|1345769687|
// | 2| def| B| 3| rtf| C|1345769687|
// | 4| wrr| D| 3| rtf| C|1345769687|
// +-----+--------+----------+-----+--------+----------+----------+
或者,您可以在加入 vertices
列之前相应地重命名它们,如下所示:
val cols = vertices.columns
val v_src = vertices.toDF(cols.map(_ + "_src"): _*)
val v_dst = vertices.toDF(cols.map(_ + "_dst"): _*)
val result = edges.
join(v_src, $"srcId" === $"id_src", "inner").
join(v_dst, $"dstId" === $"id_dst", "inner")