来自数组列的 Pyspark 样本值

Question

我的 spark 数据框如下：

target_id   other_ids
3733345     [3731634, 3729995, 3728014, 3708332, 3720...
3725312     [3711541, 3726052, 3733763, 900056057, 371...
3717114     [3701718, 3713481, 3715433, 3714825, 3731...
3408996     [3405896, 3250400, 3237054, 3242492, 3256...
3354970     [3354969, 3347893, 3348168, 3353273, 3356...

我想先打乱 other_ids 列数组中的元素，然后创建一个新列 new_id，我从 other_ids 列的数组中抽取一个 id，其中target_id 不在 other_ids.
中最终结果：

target_id   other_ids                                      new_id
3733345     [3731634, 3729995, 3728014, 3708332, 3720...   3708332
3725312     [3711541, 3726052, 3733763, 900056057, 371...  900056057
3717114     [3701718, 3713481, 3715433, 3714825, 3731...   3250400
3408996     [3405896, 3250400, 3237054, 3242492, 3256...   3237054
3354970     [3354969, 3347893, 3348168, 3353273, 3356...   3353273

有什么建议吗？谢谢。

Answer 1

你可以试试这个-

df = df.withColumn('new_id', F.element_at(
    F.shuffle(
        F.array_except(F.col('other_ids'), F.array(F.col('target_id')))
    ),
    1
))

来自数组列的 Pyspark 样本值

Pyspark sample value from array column

apache-spark

apache-spark-sql

pyspark