Spark - 分解和合并列

Question

我有要在 PySpark SQL 中处理的数据，如下所示：

+---------+----------------+
|user_id  |user_ids        |
+---------+----------------+
|null     |[479534, 1234]  |
|null     |[1234]          |
|null     |[479535]        |
|null     |[479535, 479536]|
|null     |[1234]          |
|null     |[479535]        |
|1234567  |null            |
|1234567  |null            |
|777      |null            |
|888      |null            |
|null     |null            |
+---------+----------------+

我只需要一个 user_id 列，还有从 user_ids 展开的其他行，所以像这样：

+---------+
|user_id  |
+---------+
|479534   |
|1234     |
|1234     |
|479535   |
|479535   |
|479536   |
|1234     |
|479535   |
|1234567  |
|1234567  |
|777      |
|888      |
|null     |
+---------+

我怎样才能做到这一点？

我试过：

    .withColumn("user_ids", F.explode_outer("user_ids"))
    .withColumn("user_id", F.coalesce(df["user_id"], df["user_ids"]))

但是我得到：

cannot resolve 'coalesce(user_id, user_ids)' due to data type mismatch: input to function coalesce should all be the same type, but it's [bigint, array<bigint>];

所以我认为 withColumn 在这种情况下不能使用另一个创建的列。

Answer 1

爆炸后不保存数据框，因此不要将列引用为 df['col']，而只需调用 F.col('col')。例如，

df.withColumn('user_ids', F.explode_outer('user_ids'))
  .withColumn('user_id',  F.coalesce(F.col('user_id'), F.col('user_ids')))

这是我的试用版。

from pyspark.sql import functions as f


df = spark.createDataFrame([[None, [479534, 1234]], [1234567, None]], ['user_id', 'user_ids'])
df.show()

+-------+--------------+
|user_id|      user_ids|
+-------+--------------+
|   null|[479534, 1234]|
|1234567|          null|
+-------+--------------+

df.withColumn('user_ids', f.explode_outer('user_ids')) \
  .withColumn('user_id',  f.coalesce(f.col('user_id'), f.col('user_ids'))) \
  .drop('user_ids') \
  .show()

+-------+
|user_id|
+-------+
| 479534|
|   1234|
|1234567|
+-------+

Spark - 分解和合并列

Spark - explode and merge columns

python

apache-spark

apache-spark-sql

pyspark