pyspark 中数组 <string> 的拆分和映射字段

Question

我有一个 Pyspark 数据框，如下所示，有 7 列，其中 6 个字段是数组，一列是数组。

示例数据如下


+------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+-------------------+--------------------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------+
|customer_id                                                                                                       |equipment_id                                                                                                       |type              |language            |country            |lang_cnt_str                                                                                                                              |model_num                                    |
+------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+-------------------+--------------------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------+
|[18e644bb-4342-4c22-ab9b-a90fda50ad69, 70f0b998-3e4e-422d-b863-1f5f455c4883, 54a99992-5403-4946-b059-f71ec7ef2cca]|[1407c4a9-b075-4837-bada-690da10717cd, fc4632f3-302b-43cb-9245-ede2d1ac590f, 1407c4a9-b075-4837-bada-690da10717cd]|[comm, comm, vspec]|[cs, en-GB, pt-PT]  |[[CZ], [PT], [PT]] |[(language = 'cs' AND country IS IN ('CZ')), (language = 'en-GB' AND country IS IN ('PT')), (language = 'pt-PT' AND country IS IN ('PT'))]|[1618832612617, 1618832612858, 1618832614027]|
+------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+-------------------+--------------------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------+

我想拆分和映射所有列的每个元素。以下是预期的输出。


+---------------------------------------+---------------------------------------+-------+-----------+------------+--------------------------------------------------+-------------------+
|customer_id                            |equipment_id                           |type   |language   |country     |lang_cnt_str                                      |model_num          |
+---------------------------------------+---------------------------------------+-------+-----------+------------+--------------------------------------------------+-------------------+
|18e644bb-4342-4c22-ab9b-a90fda50ad69   |1407c4a9-b075-4837-bada-690da10717cd   |comm   |cs         |[CZ]        |(language = 'cs' AND country IS IN ('CZ'))        |1618832612617      |
|70f0b998-3e4e-422d-b863-1f5f455c4883   |fc4632f3-302b-43cb-9245-ede2d1ac590f   |comm   |en-GB      |[PT]        |(language = 'en-GB' AND country IS IN ('PT'))     |1618832612858      |
|54a99992-5403-4946-b059-f71ec7ef2cca   |1407c4a9-b075-4837-bada-690da10717cd   |vspec  |pt-PT      |[PT]        |(language = 'pt-PT' AND country IS IN ('PT'))     |1618832614027      |
+---------------------------------------+---------------------------------------+-------+-----------+------------+--------------------------------------------------+-------------------+

我们如何在 pyspark 中实现这一点。有人可以帮帮我吗。提前致谢！！

Answer 1

我们在上面交换了一些评论，我认为 array(array(string)) 列没有什么特别之处。所以我 post 这个答案显示 post 在

中编辑的解决方案

df = spark.createDataFrame([
  (['1', '2', '3'], [['1'], ['2'], ['3']])
], ['col1', 'col2'])

df = (df
      .withColumn('zipped', f.arrays_zip(f.col('col1'), f.col('col2')))
      .withColumn('unzipped', f.explode(f.col('zipped')))
      .select(f.col('unzipped.col1'),
              f.col('unzipped.col2')
             )
     )

df.show()

输入为：

+---------+---------------+
|     col1|           col2|
+---------+---------------+
|[1, 2, 3]|[[1], [2], [3]]|
+---------+---------------+

输出为：

+----+----+
|col1|col2|
+----+----+
|   1| [1]|
|   2| [2]|
|   3| [3]|
+----+----+

pyspark 中数组 <string> 的拆分和映射字段

Split & Map fields of array<string> in pyspark

python

apache-spark

pyspark

apache-spark-sql