将数组结构拆分为单值列 Spark scala
Split array struct to single value column Spark scala
我有一个包含单个数组结构列的数据框,我想在其中拆分嵌套值并添加为逗号分隔的字符串新列
示例数据框:
测试
{id:1,name:foo},{id:2,name:bar}
预期结果数据框
tests tests_id tests_name
[id:1,name:foo],[id:2,name:bar] 1, 2 foo, bar
我尝试了下面的代码但出现错误
df.withColumn("tests_name", concat_ws(",", explode(col("tests.name"))))
错误:
org.apache.spark.sql.AnalysisException: Generators are not supported when it's nested in expressions, but got: concat_ws(,, explode(tests.name AS `name`));
取决于您使用的 Spark 版本。假设数据帧方案如下
root
|-- test: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = true)
| | |-- name: string (nullable = true)
Spark 3.0.0
df.withColumn("id", concat_ws(",", transform($"test", x => x.getField("id"))))
.withColumn("name", concat_ws(",", transform($"test", x => x.getField("name"))))
.show(false)
Spark 2.4.0+
df.withColumn("id", concat_ws(",", expr("transform(test, x -> x.id)")))
.withColumn("name", concat_ws(",", expr("transform(test, x -> x.name)")))
.show(false)
Spark < 2.4
val extract_id = udf((test: Seq[Row]) => test.map(_.getAs[Long]("id")))
val extract_name = udf((test: Seq[Row]) => test.map(_.getAs[String]("name")))
df.withColumn("id", concat_ws(",", extract_id($"test")))
.withColumn("name", concat_ws(",", extract_name($"test")))
.show(false)
输出:
+--------------------+---+-------+
|test |id |name |
+--------------------+---+-------+
|[[1, foo], [2, bar]]|1,2|foo,bar|
|[[3, foo], [4, bar]]|3,4|foo,bar|
+--------------------+---+-------+
我有一个包含单个数组结构列的数据框,我想在其中拆分嵌套值并添加为逗号分隔的字符串新列 示例数据框: 测试
{id:1,name:foo},{id:2,name:bar}
预期结果数据框
tests tests_id tests_name
[id:1,name:foo],[id:2,name:bar] 1, 2 foo, bar
我尝试了下面的代码但出现错误
df.withColumn("tests_name", concat_ws(",", explode(col("tests.name"))))
错误:
org.apache.spark.sql.AnalysisException: Generators are not supported when it's nested in expressions, but got: concat_ws(,, explode(tests.name AS `name`));
取决于您使用的 Spark 版本。假设数据帧方案如下
root
|-- test: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = true)
| | |-- name: string (nullable = true)
Spark 3.0.0
df.withColumn("id", concat_ws(",", transform($"test", x => x.getField("id"))))
.withColumn("name", concat_ws(",", transform($"test", x => x.getField("name"))))
.show(false)
Spark 2.4.0+
df.withColumn("id", concat_ws(",", expr("transform(test, x -> x.id)")))
.withColumn("name", concat_ws(",", expr("transform(test, x -> x.name)")))
.show(false)
Spark < 2.4
val extract_id = udf((test: Seq[Row]) => test.map(_.getAs[Long]("id")))
val extract_name = udf((test: Seq[Row]) => test.map(_.getAs[String]("name")))
df.withColumn("id", concat_ws(",", extract_id($"test")))
.withColumn("name", concat_ws(",", extract_name($"test")))
.show(false)
输出:
+--------------------+---+-------+
|test |id |name |
+--------------------+---+-------+
|[[1, foo], [2, bar]]|1,2|foo,bar|
|[[3, foo], [4, bar]]|3,4|foo,bar|
+--------------------+---+-------+