替换pyspark数据框中的特殊字符？

Question

我有一个值为 *NZ 的列，我想删除 *，

df.groupBy('State1').count().show()
(5) Spark Jobs
+-----------+-----+
|     State1|count|
+-----------+-----+
|         NT| 1423|
|        ACT| 2868|
|         SA|12242|
|        TAS| 4603|
|         WA|35848|
|        *NZ|  806|
|        QLD|44410|
|    missing| 2612|
|        VIC|40607|
|        NSW|45195|
+-----------+-----+

这两个我都试过了

df = df.select("State1", f.translate(f.col("State1"), "*", ""))
df = df.withColumn('State1', regexp_replace('State1', '*',''))

第一个代码没有做任何事情

第二个代码运行但是当我显示抛出错误时

df.groupBy('State1').count().show()

Py4JJavaError                             Traceback (most recent call last)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 725.0 failed 1 times, most recent failure: Lost task 1.0 in stage 725.0 (TID 13480, localhost, executor driver): java.util.regex.PatternSyntaxException: Dangling meta character &#39;*&#39; near index 0

如何替换*

Answer 1

您可以在 "\*":

上使用 regexp_replace 执行此操作

from pyspark.sql import functions as F
df.withColumn("State1", F.regexp_replace("State1","\*","")).show()

+-------+-----+
| State1|count|
+-------+-----+
|     NT| 1423|
|    ACT| 2868|
|     SA|12242|
|    TAS| 4603|
|     WA|35848|
|     NZ|  806|
|    QLD|44410|
|missing| 2612|
|    VIC|40607|
|    NSW|45195|
+-------+-----+

如 @mazaneicha 评论中所述。您也可以使用 replace。

from pyspark.sql import functions as F
df.withColumn("State1", F.expr("""replace(state1,'*')""")).show()

Answer 2

这有效

from pyspark.sql.functions import UserDefinedFunction
udf = UserDefinedFunction(lambda x: x.replace("*",""), StringType())
df = df.withColumn("State1", udf(col("State1")))

Answer 3

因为*有特殊含义（通配符），所以按照这个方法

from pyspark.sql import functions as F
>>> df2 = df.withColumn("CleanState",F.regexp_replace("State","\*",""))
>>> df2.show()
+-------+-----+----------+
|  State|count|CleanState|
+-------+-----+----------+
|     NT| 1423|        NT|
|   A*CT| 2868|       ACT|
|     SA|12242|        SA|
|    TAS| 4603|       TAS|
|     WA|35848|        WA|
|    *NZ|  806|        NZ|
|   QLD*|44410|       QLD|
|missing| 2612|   missing|
|    VIC|40607|       VIC|
|    NSW|45195|       NSW|
+-------+-----+----------+

替换pyspark数据框中的特殊字符？

replace special char in pyspark dataframe?

apache-spark

pyspark

pyspark-dataframes