加入两个数据帧并使用 Spark Scala 替换原始列值

Question

我有两个DF

df1:

+---+-----+--------+
|key|price|    date|
+---+-----+--------+
|  1|  1.0|20210101|
|  2|  2.0|20210101|
|  3|  3.0|20210101|
+---+-----+--------+

df2:

+---+-----+
|key|price|
+---+-----+
|  1|  1.1|
|  2|  2.2|
|  3|  3.3|
+---+-----+

我想用 df2 中的 price 值替换 df1 中的 price 列值，其中 df1.key == df2.key

预期输出：

+---+-----+--------+
|key|price|    date|
+---+-----+--------+
|  1|  1.1|20210101|
|  2|  2.1|20210101|
|  3|  3.3|20210101|
+---+-----+--------+

我在 python 中找到了一些解决方案，但我无法在 Scala 中找到有效的解决方案。

Answer 1

只需加入+删除df1栏目价格：

val df = df1.join(df2, Seq("key")).drop(df1("price"))

df.show
//+---+-----+--------+
//|key|price|    date|
//+---+-----+--------+
//|  1|  1.1|20210101|
//|  2|  2.2|20210101|
//|  3|  3.3|20210101|
//+---+-----+--------+

或者，如果您在 df1 中有更多条目，并且您希望在 df2 中没有匹配项时保留它们的 price，则使用左连接 + 合并表达式：

val df = df1.join(df2, Seq("key"), "left").select(
  col("key"),
  col("date"),
  coalesce(df2("price"), df1("price")).as("price")
)

加入两个数据帧并使用 Spark Scala 替换原始列值

Join two dataframes and replace the original column values using Spark Scala

scala

apache-spark

apache-spark-sql