将 pyspark 中的括号替换为 replace_regex
Replace parentheses in pyspark with replace_regex
+---+------------+
| A| B|
+---+------------+
| x1| [s1]|
| x2| [s2 (A2)]|
| x3| [s3 (A3)]|
| x4| [s4 (A4)]|
| x5| [s5 (A5)]|
| x6| [s6 (A6)]|
+---+------------+
想要的结果:
+---+------------+-------+
|A |B |value |
+---+------------+-------+
|x1 |[s1] |[s1] |
|x2 |[s2 (A2)] |[s2] |
|x3 |[s3 (A3)] |[s3] |
|x4 |[s4 (A4)] |[s4] |
|x5 |[s5 (A5)] |[s5] |
|x6 |[s6 (A6)] |[s6] |
+---+------------+-------+
当我应用下面的每个代码时,它们之前的括号和空格都没有被替换:
from pyspark.sql.functions import expr
df.withColumn("C",
expr('''transform(B, x-> regexp_replace(x, ' \(A.\)', ''))''')).show(truncate=False)
或
df.withColumn("C",
expr('''transform(B, x-> regexp_replace(x, ' \(A.\)', ''))''')).show(truncate=False)
得到的结果:
+---+------------+------------+
|A |B |value |
+---+------------+------------+
|x1 |[s1] |[s1] |
|x2 |[s2 (A2)] |[s2 ()] |
|x3 |[s3 (A3)] |[s3 ()] |
|x4 |[s4 (A4)] |[s4 ()] |
|x5 |[s5 (A5)] |[s5 ()] |
|x6 |[s6 (A6)] |[s6 ()] |
+---+------------+------------+
您可以拆分数组值并从数组中仅获取first index
。
- (或)使用
regexp_replace
函数。
Example:
df.show()
#+---+---------+
#| A| B|
#+---+---------+
#| x1| [s1]|
#| x2|[s2 (A2)]|
#+---+---------+
df.printSchema()
#root
# |-- A: string (nullable = true)
# |-- B: array (nullable = true)
# | |-- element: string (containsNull = true)
df.withColumn("C",expr('''transform(B,x -> split(x,"\\s+")[0])''')).show()
#using regexp_replace function
df.withColumn("C",expr('''transform(B,x -> regexp_replace(x,"(\\s+.*)",""))''')).show()
df.withColumn("C",expr('''transform(B,x -> regexp_replace(x,"(\\s+\\((?i)A.+\\))",""))''')).show()
#+---+---------+----+
#| A| B| C|
#+---+---------+----+
#| x1| [s1]|[s1]|
#| x2|[s2 (A2)]|[s2]|
#+---+---------+----+
您可以创建一个 UDF,从数组中删除与正则表达式 r"\(.*\)"
匹配的所有元素。如有必要,您可以更改正则表达式以匹配 r"\(A.\)"
(如果需要)。
import re
replaced = F.udf(lambda arr: [s for s in arr if not re.compile(r"\(.*\)").match(s)], \
T.ArrayType(T.StringType()))
df.withColumn("value", replaced("B")).show()
+---+------------+
| A| B|
+---+------------+
| x1| [s1]|
| x2| [s2 (A2)]|
| x3| [s3 (A3)]|
| x4| [s4 (A4)]|
| x5| [s5 (A5)]|
| x6| [s6 (A6)]|
+---+------------+
想要的结果:
+---+------------+-------+
|A |B |value |
+---+------------+-------+
|x1 |[s1] |[s1] |
|x2 |[s2 (A2)] |[s2] |
|x3 |[s3 (A3)] |[s3] |
|x4 |[s4 (A4)] |[s4] |
|x5 |[s5 (A5)] |[s5] |
|x6 |[s6 (A6)] |[s6] |
+---+------------+-------+
当我应用下面的每个代码时,它们之前的括号和空格都没有被替换:
from pyspark.sql.functions import expr
df.withColumn("C",
expr('''transform(B, x-> regexp_replace(x, ' \(A.\)', ''))''')).show(truncate=False)
或
df.withColumn("C",
expr('''transform(B, x-> regexp_replace(x, ' \(A.\)', ''))''')).show(truncate=False)
得到的结果:
+---+------------+------------+
|A |B |value |
+---+------------+------------+
|x1 |[s1] |[s1] |
|x2 |[s2 (A2)] |[s2 ()] |
|x3 |[s3 (A3)] |[s3 ()] |
|x4 |[s4 (A4)] |[s4 ()] |
|x5 |[s5 (A5)] |[s5 ()] |
|x6 |[s6 (A6)] |[s6 ()] |
+---+------------+------------+
您可以拆分数组值并从数组中仅获取first index
。
- (或)使用
regexp_replace
函数。
Example:
df.show()
#+---+---------+
#| A| B|
#+---+---------+
#| x1| [s1]|
#| x2|[s2 (A2)]|
#+---+---------+
df.printSchema()
#root
# |-- A: string (nullable = true)
# |-- B: array (nullable = true)
# | |-- element: string (containsNull = true)
df.withColumn("C",expr('''transform(B,x -> split(x,"\\s+")[0])''')).show()
#using regexp_replace function
df.withColumn("C",expr('''transform(B,x -> regexp_replace(x,"(\\s+.*)",""))''')).show()
df.withColumn("C",expr('''transform(B,x -> regexp_replace(x,"(\\s+\\((?i)A.+\\))",""))''')).show()
#+---+---------+----+
#| A| B| C|
#+---+---------+----+
#| x1| [s1]|[s1]|
#| x2|[s2 (A2)]|[s2]|
#+---+---------+----+
您可以创建一个 UDF,从数组中删除与正则表达式 r"\(.*\)"
匹配的所有元素。如有必要,您可以更改正则表达式以匹配 r"\(A.\)"
(如果需要)。
import re
replaced = F.udf(lambda arr: [s for s in arr if not re.compile(r"\(.*\)").match(s)], \
T.ArrayType(T.StringType()))
df.withColumn("value", replaced("B")).show()