在 Scala 中将来自不同数据帧的行合并在一起
Merging rows from different dataframes together in Scala
例如,首先我有一个这样的数据框
+----+-----+-----+--------------------+-----+
|year| make|model| comment|blank|
+----+-----+-----+--------------------+-----+
|2012|Tesla| S| No comment| |
|1997| Ford| E350|Go get one now th...| |
|2015|Chevy| Volt| null| null|
+----+-----+-----+--------------------+-----+
我们有 2012 年、1997 年和 2015 年。我们还有另一个像这样的 Dataframe
+----+-----+-----+--------------------+-----+
|year| make|model| comment|blank|
+----+-----+-----+--------------------+-----+
|2012|BMW | 3| No comment| |
|1997|VW | GTI | get | |
|2015|MB | C200| good| null|
+----+-----+-----+--------------------+-----+
我们还有 2012 年、1997 年、2015 年。我们如何将同一年的行合并在一起?谢谢
输出应该是这样的
+----+-----+-----+--------------------+-----++-----+-----+--------------------------+
|year| make|model| comment|blank|| make|model| comment|blank|
+----+-----+-----+--------------------+-----++-----+-----+-----+--------------------+
|2012|Tesla| S| No comment| |BMW | 3 | no comment|
|1997| Ford| E350|Go get one now th...| |VW |GTI | get |
|2015|Chevy| Volt| null| null|MB |C200 | Good |null
+----+-----+-----+--------------------+-----++----+-----+-----+---------------------+
您可以通过简单的 join
获得您想要的 table。类似于:
val joined = df1.join(df2, df1("year") === df2("year"))
我加载了您的输入,因此我看到了以下内容:
scala> df1.show
...
year make model comment
2012 Tesla S No comment
1997 Ford E350 Go get one now
2015 Chevy Volt null
scala> df2.show
...
year make model comment
2012 BMW 3 No comment
1997 VW GTI get
2015 MB C200 good
当我 运行 join
时,我得到:
scala> val joined = df1.join(df2, df1("year") === df2("year"))
joined: org.apache.spark.sql.DataFrame = [year: string, make: string, model: string, comment: string, year: string, make: string, model: string, comment: string]
scala> joined.show
...
year make model comment year make model comment
2012 Tesla S No comment 2012 BMW 3 No comment
2015 Chevy Volt null 2015 MB C200 good
1997 Ford E350 Go get one now 1997 VW GTI get
需要注意的一件事是您的列名可能不明确,因为它们在数据帧中的名称相同(因此您可以更改它们的名称以使对结果数据帧的操作更容易编写)。
例如,首先我有一个这样的数据框
+----+-----+-----+--------------------+-----+
|year| make|model| comment|blank|
+----+-----+-----+--------------------+-----+
|2012|Tesla| S| No comment| |
|1997| Ford| E350|Go get one now th...| |
|2015|Chevy| Volt| null| null|
+----+-----+-----+--------------------+-----+
我们有 2012 年、1997 年和 2015 年。我们还有另一个像这样的 Dataframe
+----+-----+-----+--------------------+-----+
|year| make|model| comment|blank|
+----+-----+-----+--------------------+-----+
|2012|BMW | 3| No comment| |
|1997|VW | GTI | get | |
|2015|MB | C200| good| null|
+----+-----+-----+--------------------+-----+
我们还有 2012 年、1997 年、2015 年。我们如何将同一年的行合并在一起?谢谢
输出应该是这样的
+----+-----+-----+--------------------+-----++-----+-----+--------------------------+
|year| make|model| comment|blank|| make|model| comment|blank|
+----+-----+-----+--------------------+-----++-----+-----+-----+--------------------+
|2012|Tesla| S| No comment| |BMW | 3 | no comment|
|1997| Ford| E350|Go get one now th...| |VW |GTI | get |
|2015|Chevy| Volt| null| null|MB |C200 | Good |null
+----+-----+-----+--------------------+-----++----+-----+-----+---------------------+
您可以通过简单的 join
获得您想要的 table。类似于:
val joined = df1.join(df2, df1("year") === df2("year"))
我加载了您的输入,因此我看到了以下内容:
scala> df1.show
...
year make model comment
2012 Tesla S No comment
1997 Ford E350 Go get one now
2015 Chevy Volt null
scala> df2.show
...
year make model comment
2012 BMW 3 No comment
1997 VW GTI get
2015 MB C200 good
当我 运行 join
时,我得到:
scala> val joined = df1.join(df2, df1("year") === df2("year"))
joined: org.apache.spark.sql.DataFrame = [year: string, make: string, model: string, comment: string, year: string, make: string, model: string, comment: string]
scala> joined.show
...
year make model comment year make model comment
2012 Tesla S No comment 2012 BMW 3 No comment
2015 Chevy Volt null 2015 MB C200 good
1997 Ford E350 Go get one now 1997 VW GTI get
需要注意的一件事是您的列名可能不明确,因为它们在数据帧中的名称相同(因此您可以更改它们的名称以使对结果数据帧的操作更容易编写)。