PySpark:将字符串转换为列的字符串数组
PySpark: Convert String to Array of String for a column
我有一个这样的数据框
data = [(('ID1', "[apples, mangos, eggs, milk, oranges]")),
(('ID1', "[eggs, milk, cereals, mangos, apples]"))]
df = spark.createDataFrame(data, ['ID', "colval"])
df.show(truncate=False)
df.printSchema()
+---+-------------------------------------+
|ID |colval |
+---+-------------------------------------+
|ID1|[apples, mangos, eggs, milk, oranges]|
|ID1|[eggs, milk, cereals, mangos, apples]|
+---+-------------------------------------+
root
|-- ID: string (nullable = true)
|-- colval: string (nullable = true)
我想将 colval
转换为字符串数组
当我在拆分后取第一个元素时,它 returns 得到与第一个相同的结果。有帮助吗?
root
|-- ID: string (nullable = true)
|-- colval: array (nullable = true)
| |-- element: string (containsNull = true)
我尝试使用 split
,但最终得到了这个结果
df = df.withColumn('colval', split('colval', "', ?'"))
df.show(truncate = False)
df.printSchema()
+---+---------------------------------------+
|ID |colval |
+---+---------------------------------------+
|ID1|[[apples, mangos, eggs, milk, oranges]]|
|ID1|[[eggs, milk, cereals, mangos, apples]]|
+---+---------------------------------------+
root
|-- ID: string (nullable = true)
|-- colval: array (nullable = true)
| |-- element: string (containsNull = true)
你可以替换[
和]
然后拆分:
df.withColumn("colval",F.split(F.regexp_replace("colval",r"\[|\]",""),",")).show()
+---+-----------------------------------------+
|ID |colval |
+---+-----------------------------------------+
|ID1|[apples, mangos, eggs, milk, oranges]|
|ID1|[eggs, milk, cereals, mangos, apples]|
+---+-----------------------------------------+
root
|-- ID: string (nullable = true)
|-- colval: array (nullable = true)
| |-- element: string (containsNull = true)
如果你想在拆分后trim,你可以在拆分后使用高阶函数:
(df.withColumn("colval",F.split(F.regexp_replace("colval",r"\[|\]",""),","))
.withColumn("colval",F.expr("transform(colval,x-> trim(x))")))
方法一和方法二的验证和区别(注意多余空格)
df.withColumn("colval",F.split(F.regexp_replace("colval",r"\[|\]",""),",")).collect()
[Row(ID='ID1', colval=['apples', ' mangos', ' eggs', ' milk', ' oranges']),
Row(ID='ID1', colval=['eggs', ' milk', ' cereals', ' mangos', ' apples'])]
(df.withColumn("colval",F.split(F.regexp_replace("colval",r"\[|\]",""),","))
.withColumn("colval",F.expr("transform(colval,x-> trim(x))"))).collect()
[Row(ID='ID1', colval=['apples', 'mangos', 'eggs', 'milk', 'oranges']),
Row(ID='ID1', colval=['eggs', 'milk', 'cereals', 'mangos', 'apples'])]
我有一个这样的数据框
data = [(('ID1', "[apples, mangos, eggs, milk, oranges]")),
(('ID1', "[eggs, milk, cereals, mangos, apples]"))]
df = spark.createDataFrame(data, ['ID', "colval"])
df.show(truncate=False)
df.printSchema()
+---+-------------------------------------+
|ID |colval |
+---+-------------------------------------+
|ID1|[apples, mangos, eggs, milk, oranges]|
|ID1|[eggs, milk, cereals, mangos, apples]|
+---+-------------------------------------+
root
|-- ID: string (nullable = true)
|-- colval: string (nullable = true)
我想将 colval
转换为字符串数组
当我在拆分后取第一个元素时,它 returns 得到与第一个相同的结果。有帮助吗?
root
|-- ID: string (nullable = true)
|-- colval: array (nullable = true)
| |-- element: string (containsNull = true)
我尝试使用 split
,但最终得到了这个结果
df = df.withColumn('colval', split('colval', "', ?'"))
df.show(truncate = False)
df.printSchema()
+---+---------------------------------------+
|ID |colval |
+---+---------------------------------------+
|ID1|[[apples, mangos, eggs, milk, oranges]]|
|ID1|[[eggs, milk, cereals, mangos, apples]]|
+---+---------------------------------------+
root
|-- ID: string (nullable = true)
|-- colval: array (nullable = true)
| |-- element: string (containsNull = true)
你可以替换[
和]
然后拆分:
df.withColumn("colval",F.split(F.regexp_replace("colval",r"\[|\]",""),",")).show()
+---+-----------------------------------------+
|ID |colval |
+---+-----------------------------------------+
|ID1|[apples, mangos, eggs, milk, oranges]|
|ID1|[eggs, milk, cereals, mangos, apples]|
+---+-----------------------------------------+
root
|-- ID: string (nullable = true)
|-- colval: array (nullable = true)
| |-- element: string (containsNull = true)
如果你想在拆分后trim,你可以在拆分后使用高阶函数:
(df.withColumn("colval",F.split(F.regexp_replace("colval",r"\[|\]",""),","))
.withColumn("colval",F.expr("transform(colval,x-> trim(x))")))
方法一和方法二的验证和区别(注意多余空格)
df.withColumn("colval",F.split(F.regexp_replace("colval",r"\[|\]",""),",")).collect()
[Row(ID='ID1', colval=['apples', ' mangos', ' eggs', ' milk', ' oranges']),
Row(ID='ID1', colval=['eggs', ' milk', ' cereals', ' mangos', ' apples'])]
(df.withColumn("colval",F.split(F.regexp_replace("colval",r"\[|\]",""),","))
.withColumn("colval",F.expr("transform(colval,x-> trim(x))"))).collect()
[Row(ID='ID1', colval=['apples', 'mangos', 'eggs', 'milk', 'oranges']),
Row(ID='ID1', colval=['eggs', 'milk', 'cereals', 'mangos', 'apples'])]