pyspark 将行分成多行
pyspark break row to multiple rows
我正在尝试在 PYSPARK 中完成以下任务。下面提供了示例源。我们将在源中拥有更多记录。
来源:
预期输出:
您可以使用stack
函数:
示例数据的设置:
from pyspark.sql import Row, SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
Row(COLA='H', COLB='I', COLC='J',
COL_GRP_A_1=0.1, COL_GRP_A_2=1., COL_GRP_A_3=3.,
COL_GRP_B_1=4., COL_GRP_B_2=2.5, COL_GRP_B_3=6.,
COL_GRP_C_1=2., COL_GRP_C_2=5., COL_GRP_C_3=4.,
),
])
df.show()
# Output
+----+----+----+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
|COLA|COLB|COLC|COL_GRP_A_1|COL_GRP_A_2|COL_GRP_A_3|COL_GRP_B_1|COL_GRP_B_2|COL_GRP_B_3|COL_GRP_C_1|COL_GRP_C_2|COL_GRP_C_3|
+----+----+----+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
| H| I| J| 0.1| 1.0| 3.0| 4.0| 2.5| 6.0| 2.0| 5.0| 4.0|
+----+----+----+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
现在正在处理:
(
df
.selectExpr(
'COLA', 'COLB', 'COLC',
'stack(3, "COL_GRP_A", COL_GRP_A_1, COL_GRP_A_2, COL_GRP_A_3, "COL_GRP_B", COL_GRP_B_1, COL_GRP_B_2, COL_GRP_B_3, "COL_GRP_C", COL_GRP_C_1, COL_GRP_C_2, COL_GRP_C_3) AS (GRP, COL_VAL1, COL_VAL2, COL_VAL3)'
)
.show()
)
# Output:
+----+----+----+---------+--------+--------+--------+
|COLA|COLB|COLC| GRP|COL_VAL1|COL_VAL2|COL_VAL3|
+----+----+----+---------+--------+--------+--------+
| H| I| J|COL_GRP_A| 0.1| 1.0| 3.0|
| H| I| J|COL_GRP_B| 4.0| 2.5| 6.0|
| H| I| J|COL_GRP_C| 2.0| 5.0| 4.0|
+----+----+----+---------+--------+--------+--------+
我正在尝试在 PYSPARK 中完成以下任务。下面提供了示例源。我们将在源中拥有更多记录。
来源:
预期输出:
您可以使用stack
函数:
示例数据的设置:
from pyspark.sql import Row, SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
Row(COLA='H', COLB='I', COLC='J',
COL_GRP_A_1=0.1, COL_GRP_A_2=1., COL_GRP_A_3=3.,
COL_GRP_B_1=4., COL_GRP_B_2=2.5, COL_GRP_B_3=6.,
COL_GRP_C_1=2., COL_GRP_C_2=5., COL_GRP_C_3=4.,
),
])
df.show()
# Output
+----+----+----+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
|COLA|COLB|COLC|COL_GRP_A_1|COL_GRP_A_2|COL_GRP_A_3|COL_GRP_B_1|COL_GRP_B_2|COL_GRP_B_3|COL_GRP_C_1|COL_GRP_C_2|COL_GRP_C_3|
+----+----+----+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
| H| I| J| 0.1| 1.0| 3.0| 4.0| 2.5| 6.0| 2.0| 5.0| 4.0|
+----+----+----+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
现在正在处理:
(
df
.selectExpr(
'COLA', 'COLB', 'COLC',
'stack(3, "COL_GRP_A", COL_GRP_A_1, COL_GRP_A_2, COL_GRP_A_3, "COL_GRP_B", COL_GRP_B_1, COL_GRP_B_2, COL_GRP_B_3, "COL_GRP_C", COL_GRP_C_1, COL_GRP_C_2, COL_GRP_C_3) AS (GRP, COL_VAL1, COL_VAL2, COL_VAL3)'
)
.show()
)
# Output:
+----+----+----+---------+--------+--------+--------+
|COLA|COLB|COLC| GRP|COL_VAL1|COL_VAL2|COL_VAL3|
+----+----+----+---------+--------+--------+--------+
| H| I| J|COL_GRP_A| 0.1| 1.0| 3.0|
| H| I| J|COL_GRP_B| 4.0| 2.5| 6.0|
| H| I| J|COL_GRP_C| 2.0| 5.0| 4.0|
+----+----+----+---------+--------+--------+--------+