如何在 spark 数据帧上使用 map 或 hashmap

Question

我的数据框架构如下；

    root
    |-- rowkey: string (nullable = true)
    |-- SALES: string (nullable = true)
    |-- ID: string (nullable = true)
    |-- D_Parent_MID: string (nullable =         true)
    |-- D_ILK: string (nullable = true)
    |-- G_Parent_MID: string (nullable = true)

我想使用此数据框进一步检查特定 ID，是否存在 "D_Parent_MID"，如果存在则 use/store 该值。如果不是，则检查 "G_Parent_MID"，如果是，则 use/store 这个值。

不知道如何实现这个

Answer 1

如果我没猜错，你可以使用withColumn API，例如：

import org.apache.spark.sql.functions._

df.withColumn("ID", 
when(col("D_Parent_MID").isNotNull, col("D_Parent_MID"))
.when(col("G_Parent_MID").isNotNull, col("G_Parent_MID"))
.otherwise(col("ID"))

希望对您有所帮助！

如何在 spark 数据帧上使用 map 或 hashmap

How do I use map or hashmap on spark dataframe

apache-spark-sql

apache-spark-2.0