将双数组字段更改为配置单元或 pyspark 中的单个数组
changing an double array field to a single array in hive or pyspark
我有一个字段 interest_product_id
如下所示 -
a.select('cust_id', 'interest_product_id').show(1,False)
+---------------+----------------------------------------------+
|cust_id |interest_product_id |
+---------------+----------------------------------------------+
|4308c3w994 |[[73ndy0-885bns-ysrd, isgbf-6322-734f4-92j72]]|
+---------------+----------------------------------------------+
架构如下-
root
|-- cust_id: string (nullable = true)
|-- interest_product_id: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
由于字段interest_product_id
是数组类型,且元素是array(string)字段显示[[**]]。如何将其转换为数组(字符串)??
预期结果 -
+---------------+----------------------------------------------+
|cust_id |interest_product_id |
+---------------+----------------------------------------------+
|4308c3w994 |[73ndy0-885bns-ysrd, isgbf-6322-734f4-92j72] |
+---------------+----------------------------------------------+
请建议最佳方法。谢谢!!
flatten
,从嵌套数组创建平面数组。
from pyspark.sql import functions as F
df = spark.createDataFrame([("4308c3w994", [["73ndy0-885bns-ysrd", "isgbf-6322-734f4-92j72"]], )], ("cust_id", "interest_product_id", ))
df.withColumn("interest_product_id", F.flatten(F.col("interest_product_id"))).show(truncate=False)
输出
+----------+--------------------------------------------+
|cust_id |interest_product_id |
+----------+--------------------------------------------+
|4308c3w994|[73ndy0-885bns-ysrd, isgbf-6322-734f4-92j72]|
+----------+--------------------------------------------+
我有一个字段 interest_product_id
如下所示 -
a.select('cust_id', 'interest_product_id').show(1,False)
+---------------+----------------------------------------------+
|cust_id |interest_product_id |
+---------------+----------------------------------------------+
|4308c3w994 |[[73ndy0-885bns-ysrd, isgbf-6322-734f4-92j72]]|
+---------------+----------------------------------------------+
架构如下-
root
|-- cust_id: string (nullable = true)
|-- interest_product_id: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
由于字段interest_product_id
是数组类型,且元素是array(string)字段显示[[**]]。如何将其转换为数组(字符串)??
预期结果 -
+---------------+----------------------------------------------+
|cust_id |interest_product_id |
+---------------+----------------------------------------------+
|4308c3w994 |[73ndy0-885bns-ysrd, isgbf-6322-734f4-92j72] |
+---------------+----------------------------------------------+
请建议最佳方法。谢谢!!
flatten
,从嵌套数组创建平面数组。
from pyspark.sql import functions as F
df = spark.createDataFrame([("4308c3w994", [["73ndy0-885bns-ysrd", "isgbf-6322-734f4-92j72"]], )], ("cust_id", "interest_product_id", ))
df.withColumn("interest_product_id", F.flatten(F.col("interest_product_id"))).show(truncate=False)
输出
+----------+--------------------------------------------+
|cust_id |interest_product_id |
+----------+--------------------------------------------+
|4308c3w994|[73ndy0-885bns-ysrd, isgbf-6322-734f4-92j72]|
+----------+--------------------------------------------+