分解具有不同长度的相同类型的多个列
Explode multiple columns of same type with different lengths
我有一个具有以下格式的 spark 数据框需要分解。我检查了其他解决方案,例如 。但是,在我的例子中,before
和 after
可以是不同长度的数组。
root
|-- id: string (nullable = true)
|-- before: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- start_time: string (nullable = true)
| | |-- end_time: string (nullable = true)
| | |-- area: string (nullable = true)
|-- after: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- start_time: string (nullable = true)
| | |-- end_time: string (nullable = true)
| | |-- area: string (nullable = true)
例如,如果数据框只有一行,before
是一个大小为 2 的数组,after
是一个大小为 3 的数组,分解后的版本应该有 5 行以下架构:
root
|-- id: string (nullable = true)
|-- type: string (nullable = true)
|-- start_time: integer (nullable = false)
|-- end_time: string (nullable = true)
|-- area: string (nullable = true)
其中 type
是一个 新列,可以是 "before"
或 "after"。
我可以在两个单独的展开中执行此操作,在每个展开中创建 type
列,然后 union
。
val dfSummary1 = df.withColumn("before_exp",
explode($"before")).withColumn("type",
lit("before")).withColumn(
"start_time", $"before_exp.start_time").withColumn(
"end_time", $"before_exp.end_time").withColumn(
"area", $"before_exp.area").drop("before_exp", "before")
val dfSummary2 = df.withColumn("after_exp",
explode($"after")).withColumn("type",
lit("after")).withColumn(
"start_time", $"after_exp.start_time").withColumn(
"end_time", $"after_exp.end_time").withColumn(
"area", $"after_exp.area").drop("after_exp", "after")
val dfResult = dfSumamry1.unionAll(dfSummary2)
但是,我想知道是否有更优雅的方法来做到这一点。谢谢。
我认为 exploding
两列分别后跟一个 union
是一个不错的直接方法。您可以稍微简化 StructField 元素的选择,并为重复的 explode
过程创建一个简单的方法,如下所示:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.DataFrame
case class Area(start_time: String, end_time: String, area: String)
val df = Seq((
"1", Seq(Area("01:00", "01:30", "10"), Area("02:00", "02:30", "20")),
Seq(Area("07:00", "07:30", "70"), Area("08:00", "08:30", "80"), Area("09:00", "09:30", "90"))
)).toDF("id", "before", "after")
def explodeCol(df: DataFrame, colName: String): DataFrame = {
val expColName = colName + "_exp"
df.
withColumn("type", lit(colName)).
withColumn(expColName, explode(col(colName))).
select("id", "type", expColName + ".*")
}
val dfResult = explodeCol(df, "before") union explodeCol(df, "after")
dfResult.show
// +---+------+----------+--------+----+
// | id| type|start_time|end_time|area|
// +---+------+----------+--------+----+
// | 1|before| 01:00| 01:30| 10|
// | 1|before| 02:00| 02:30| 20|
// | 1| after| 07:00| 07:30| 70|
// | 1| after| 08:00| 08:30| 80|
// | 1| after| 09:00| 09:30| 90|
// +---+------+----------+--------+----+
你也可以在没有工会的情况下实现这一点。随着数据:
case class Area(start_time: String, end_time: String, area: String)
val df = Seq((
"1", Seq(Area("01:00", "01:30", "10"), Area("02:00", "02:30", "20")),
Seq(Area("07:00", "07:30", "70"), Area("08:00", "08:30", "80"), Area("09:00", "09:30", "90"))
)).toDF("id", "before", "after")
你可以做到
df
.select($"id",
explode(
array(
struct(lit("before").as("type"), $"before".as("data")),
struct(lit("after").as("type"), $"after".as("data"))
)
).as("step1")
)
.select($"id",$"step1.type", explode($"step1.data").as("step2"))
.select($"id",$"type", $"step2.*")
.show()
+---+------+----------+--------+----+
| id| type|start_time|end_time|area|
+---+------+----------+--------+----+
| 1|before| 01:00| 01:30| 10|
| 1|before| 02:00| 02:30| 20|
| 1| after| 07:00| 07:30| 70|
| 1| after| 08:00| 08:30| 80|
| 1| after| 09:00| 09:30| 90|
+---+------+----------+--------+----+
我有一个具有以下格式的 spark 数据框需要分解。我检查了其他解决方案,例如 before
和 after
可以是不同长度的数组。
root
|-- id: string (nullable = true)
|-- before: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- start_time: string (nullable = true)
| | |-- end_time: string (nullable = true)
| | |-- area: string (nullable = true)
|-- after: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- start_time: string (nullable = true)
| | |-- end_time: string (nullable = true)
| | |-- area: string (nullable = true)
例如,如果数据框只有一行,before
是一个大小为 2 的数组,after
是一个大小为 3 的数组,分解后的版本应该有 5 行以下架构:
root
|-- id: string (nullable = true)
|-- type: string (nullable = true)
|-- start_time: integer (nullable = false)
|-- end_time: string (nullable = true)
|-- area: string (nullable = true)
其中 type
是一个 新列,可以是 "before"
或 "after"。
我可以在两个单独的展开中执行此操作,在每个展开中创建 type
列,然后 union
。
val dfSummary1 = df.withColumn("before_exp",
explode($"before")).withColumn("type",
lit("before")).withColumn(
"start_time", $"before_exp.start_time").withColumn(
"end_time", $"before_exp.end_time").withColumn(
"area", $"before_exp.area").drop("before_exp", "before")
val dfSummary2 = df.withColumn("after_exp",
explode($"after")).withColumn("type",
lit("after")).withColumn(
"start_time", $"after_exp.start_time").withColumn(
"end_time", $"after_exp.end_time").withColumn(
"area", $"after_exp.area").drop("after_exp", "after")
val dfResult = dfSumamry1.unionAll(dfSummary2)
但是,我想知道是否有更优雅的方法来做到这一点。谢谢。
我认为 exploding
两列分别后跟一个 union
是一个不错的直接方法。您可以稍微简化 StructField 元素的选择,并为重复的 explode
过程创建一个简单的方法,如下所示:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.DataFrame
case class Area(start_time: String, end_time: String, area: String)
val df = Seq((
"1", Seq(Area("01:00", "01:30", "10"), Area("02:00", "02:30", "20")),
Seq(Area("07:00", "07:30", "70"), Area("08:00", "08:30", "80"), Area("09:00", "09:30", "90"))
)).toDF("id", "before", "after")
def explodeCol(df: DataFrame, colName: String): DataFrame = {
val expColName = colName + "_exp"
df.
withColumn("type", lit(colName)).
withColumn(expColName, explode(col(colName))).
select("id", "type", expColName + ".*")
}
val dfResult = explodeCol(df, "before") union explodeCol(df, "after")
dfResult.show
// +---+------+----------+--------+----+
// | id| type|start_time|end_time|area|
// +---+------+----------+--------+----+
// | 1|before| 01:00| 01:30| 10|
// | 1|before| 02:00| 02:30| 20|
// | 1| after| 07:00| 07:30| 70|
// | 1| after| 08:00| 08:30| 80|
// | 1| after| 09:00| 09:30| 90|
// +---+------+----------+--------+----+
你也可以在没有工会的情况下实现这一点。随着数据:
case class Area(start_time: String, end_time: String, area: String)
val df = Seq((
"1", Seq(Area("01:00", "01:30", "10"), Area("02:00", "02:30", "20")),
Seq(Area("07:00", "07:30", "70"), Area("08:00", "08:30", "80"), Area("09:00", "09:30", "90"))
)).toDF("id", "before", "after")
你可以做到
df
.select($"id",
explode(
array(
struct(lit("before").as("type"), $"before".as("data")),
struct(lit("after").as("type"), $"after".as("data"))
)
).as("step1")
)
.select($"id",$"step1.type", explode($"step1.data").as("step2"))
.select($"id",$"type", $"step2.*")
.show()
+---+------+----------+--------+----+
| id| type|start_time|end_time|area|
+---+------+----------+--------+----+
| 1|before| 01:00| 01:30| 10|
| 1|before| 02:00| 02:30| 20|
| 1| after| 07:00| 07:30| 70|
| 1| after| 08:00| 08:30| 80|
| 1| after| 09:00| 09:30| 90|
+---+------+----------+--------+----+