如何动态地连接和划分数据框中未知数量的列与另一个数据框中同名的列
How to join and divide unknown number of columns in a dataframe with columns in another dataframe with same name dynamically
最重要的是 col1、col2.... 的数量未知,可能有数百列。所以我们必须动态地进行连接和除法
spark scala如何实现?
Dataset 对象具有 columns
属性,它为您提供列名数组。很容易筛选出 df1
的列并保留 df2
中存在的列,然后使用 map
导出所需的列:
val df1 = Seq(("xyz", 10.0, 12.0),
("abc", 42.0, 7.0)).toDF("join_col", "col1", "col2")
val df2 = Seq(("xyz", 7.0, 22.0, 11.0),
("abc", 11.0, 9.0, 42.0)).toDF("join_col", "col1", "col2", "col3")
// Common columns in both datasets
val cols = df1.columns.filter(df2.columns.toSet)
val join_col = cols(0)
val joined = df1.join(df2, df1(join_col) === df2(join_col))
// Columns from df1
val df1Cols = cols.map(df1(_))
// Division columns renamed to div_whatever
val divCols = cols.drop(1).map((name) => df1(name) / df2(name) as s"div_${name}")
val finalTable = joined.select((df1Cols ++ divCols) :_*)
finalTable.show(false)
// +--------+----+----+------------------+------------------+
// |join_col|col1|col2|div_col1 |div_col2 |
// +--------+----+----+------------------+------------------+
// |xyz |10.0|12.0|1.4285714285714286|0.5454545454545454|
// |abc |42.0|7.0 |3.8181818181818183|0.7777777777777778|
// +--------+----+----+------------------+------------------+
这里假定连接列是 df1
中的第一列。
最重要的是 col1、col2.... 的数量未知,可能有数百列。所以我们必须动态地进行连接和除法
spark scala如何实现?
Dataset 对象具有 columns
属性,它为您提供列名数组。很容易筛选出 df1
的列并保留 df2
中存在的列,然后使用 map
导出所需的列:
val df1 = Seq(("xyz", 10.0, 12.0),
("abc", 42.0, 7.0)).toDF("join_col", "col1", "col2")
val df2 = Seq(("xyz", 7.0, 22.0, 11.0),
("abc", 11.0, 9.0, 42.0)).toDF("join_col", "col1", "col2", "col3")
// Common columns in both datasets
val cols = df1.columns.filter(df2.columns.toSet)
val join_col = cols(0)
val joined = df1.join(df2, df1(join_col) === df2(join_col))
// Columns from df1
val df1Cols = cols.map(df1(_))
// Division columns renamed to div_whatever
val divCols = cols.drop(1).map((name) => df1(name) / df2(name) as s"div_${name}")
val finalTable = joined.select((df1Cols ++ divCols) :_*)
finalTable.show(false)
// +--------+----+----+------------------+------------------+
// |join_col|col1|col2|div_col1 |div_col2 |
// +--------+----+----+------------------+------------------+
// |xyz |10.0|12.0|1.4285714285714286|0.5454545454545454|
// |abc |42.0|7.0 |3.8181818181818183|0.7777777777777778|
// +--------+----+----+------------------+------------------+
这里假定连接列是 df1
中的第一列。