pyspark 中数组 <string> 的拆分和映射字段
Split & Map fields of array<string> in pyspark
我有一个 Pyspark 数据框,如下所示,有 7 列,其中 6 个字段是数组,一列是数组。
示例数据如下
+------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+-------------------+--------------------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------+
|customer_id |equipment_id |type |language |country |lang_cnt_str |model_num |
+------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+-------------------+--------------------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------+
|[18e644bb-4342-4c22-ab9b-a90fda50ad69, 70f0b998-3e4e-422d-b863-1f5f455c4883, 54a99992-5403-4946-b059-f71ec7ef2cca]|[1407c4a9-b075-4837-bada-690da10717cd, fc4632f3-302b-43cb-9245-ede2d1ac590f, 1407c4a9-b075-4837-bada-690da10717cd]|[comm, comm, vspec]|[cs, en-GB, pt-PT] |[[CZ], [PT], [PT]] |[(language = 'cs' AND country IS IN ('CZ')), (language = 'en-GB' AND country IS IN ('PT')), (language = 'pt-PT' AND country IS IN ('PT'))]|[1618832612617, 1618832612858, 1618832614027]|
+------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+-------------------+--------------------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------+
我想拆分和映射所有列的每个元素。以下是预期的输出。
+---------------------------------------+---------------------------------------+-------+-----------+------------+--------------------------------------------------+-------------------+
|customer_id |equipment_id |type |language |country |lang_cnt_str |model_num |
+---------------------------------------+---------------------------------------+-------+-----------+------------+--------------------------------------------------+-------------------+
|18e644bb-4342-4c22-ab9b-a90fda50ad69 |1407c4a9-b075-4837-bada-690da10717cd |comm |cs |[CZ] |(language = 'cs' AND country IS IN ('CZ')) |1618832612617 |
|70f0b998-3e4e-422d-b863-1f5f455c4883 |fc4632f3-302b-43cb-9245-ede2d1ac590f |comm |en-GB |[PT] |(language = 'en-GB' AND country IS IN ('PT')) |1618832612858 |
|54a99992-5403-4946-b059-f71ec7ef2cca |1407c4a9-b075-4837-bada-690da10717cd |vspec |pt-PT |[PT] |(language = 'pt-PT' AND country IS IN ('PT')) |1618832614027 |
+---------------------------------------+---------------------------------------+-------+-----------+------------+--------------------------------------------------+-------------------+
我们如何在 pyspark 中实现这一点。有人可以帮帮我吗。提前致谢!!
我们在上面交换了一些评论,我认为 array(array(string)) 列没有什么特别之处。所以我 post 这个答案显示 post 在
中编辑的解决方案
df = spark.createDataFrame([
(['1', '2', '3'], [['1'], ['2'], ['3']])
], ['col1', 'col2'])
df = (df
.withColumn('zipped', f.arrays_zip(f.col('col1'), f.col('col2')))
.withColumn('unzipped', f.explode(f.col('zipped')))
.select(f.col('unzipped.col1'),
f.col('unzipped.col2')
)
)
df.show()
输入为:
+---------+---------------+
| col1| col2|
+---------+---------------+
|[1, 2, 3]|[[1], [2], [3]]|
+---------+---------------+
输出为:
+----+----+
|col1|col2|
+----+----+
| 1| [1]|
| 2| [2]|
| 3| [3]|
+----+----+
我有一个 Pyspark 数据框,如下所示,有 7 列,其中 6 个字段是数组,一列是数组
示例数据如下
+------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+-------------------+--------------------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------+
|customer_id |equipment_id |type |language |country |lang_cnt_str |model_num |
+------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+-------------------+--------------------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------+
|[18e644bb-4342-4c22-ab9b-a90fda50ad69, 70f0b998-3e4e-422d-b863-1f5f455c4883, 54a99992-5403-4946-b059-f71ec7ef2cca]|[1407c4a9-b075-4837-bada-690da10717cd, fc4632f3-302b-43cb-9245-ede2d1ac590f, 1407c4a9-b075-4837-bada-690da10717cd]|[comm, comm, vspec]|[cs, en-GB, pt-PT] |[[CZ], [PT], [PT]] |[(language = 'cs' AND country IS IN ('CZ')), (language = 'en-GB' AND country IS IN ('PT')), (language = 'pt-PT' AND country IS IN ('PT'))]|[1618832612617, 1618832612858, 1618832614027]|
+------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+-------------------+--------------------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------+
我想拆分和映射所有列的每个元素。以下是预期的输出。
+---------------------------------------+---------------------------------------+-------+-----------+------------+--------------------------------------------------+-------------------+
|customer_id |equipment_id |type |language |country |lang_cnt_str |model_num |
+---------------------------------------+---------------------------------------+-------+-----------+------------+--------------------------------------------------+-------------------+
|18e644bb-4342-4c22-ab9b-a90fda50ad69 |1407c4a9-b075-4837-bada-690da10717cd |comm |cs |[CZ] |(language = 'cs' AND country IS IN ('CZ')) |1618832612617 |
|70f0b998-3e4e-422d-b863-1f5f455c4883 |fc4632f3-302b-43cb-9245-ede2d1ac590f |comm |en-GB |[PT] |(language = 'en-GB' AND country IS IN ('PT')) |1618832612858 |
|54a99992-5403-4946-b059-f71ec7ef2cca |1407c4a9-b075-4837-bada-690da10717cd |vspec |pt-PT |[PT] |(language = 'pt-PT' AND country IS IN ('PT')) |1618832614027 |
+---------------------------------------+---------------------------------------+-------+-----------+------------+--------------------------------------------------+-------------------+
我们如何在 pyspark 中实现这一点。有人可以帮帮我吗。提前致谢!!
我们在上面交换了一些评论,我认为 array(array(string)) 列没有什么特别之处。所以我 post 这个答案显示 post 在
df = spark.createDataFrame([
(['1', '2', '3'], [['1'], ['2'], ['3']])
], ['col1', 'col2'])
df = (df
.withColumn('zipped', f.arrays_zip(f.col('col1'), f.col('col2')))
.withColumn('unzipped', f.explode(f.col('zipped')))
.select(f.col('unzipped.col1'),
f.col('unzipped.col2')
)
)
df.show()
输入为:
+---------+---------------+
| col1| col2|
+---------+---------------+
|[1, 2, 3]|[[1], [2], [3]]|
+---------+---------------+
输出为:
+----+----+
|col1|col2|
+----+----+
| 1| [1]|
| 2| [2]|
| 3| [3]|
+----+----+