替换pyspark数据框中的特殊字符?
replace special char in pyspark dataframe?
我有一个值为 *NZ 的列,我想删除 *,
df.groupBy('State1').count().show()
(5) Spark Jobs
+-----------+-----+
| State1|count|
+-----------+-----+
| NT| 1423|
| ACT| 2868|
| SA|12242|
| TAS| 4603|
| WA|35848|
| *NZ| 806|
| QLD|44410|
| missing| 2612|
| VIC|40607|
| NSW|45195|
+-----------+-----+
这两个我都试过了
df = df.select("State1", f.translate(f.col("State1"), "*", ""))
df = df.withColumn('State1', regexp_replace('State1', '*',''))
第一个代码没有做任何事情
第二个代码运行但是当我显示抛出错误时
df.groupBy('State1').count().show()
Py4JJavaError Traceback (most recent call last)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 725.0 failed 1 times, most recent failure: Lost task 1.0 in stage 725.0 (TID 13480, localhost, executor driver): java.util.regex.PatternSyntaxException: Dangling meta character '*' near index 0
如何替换*
您可以在 "\*"
:
上使用 regexp_replace
执行此操作
from pyspark.sql import functions as F
df.withColumn("State1", F.regexp_replace("State1","\*","")).show()
+-------+-----+
| State1|count|
+-------+-----+
| NT| 1423|
| ACT| 2868|
| SA|12242|
| TAS| 4603|
| WA|35848|
| NZ| 806|
| QLD|44410|
|missing| 2612|
| VIC|40607|
| NSW|45195|
+-------+-----+
如 @mazaneicha
评论中所述。您也可以使用 replace
。
from pyspark.sql import functions as F
df.withColumn("State1", F.expr("""replace(state1,'*')""")).show()
这有效
from pyspark.sql.functions import UserDefinedFunction
udf = UserDefinedFunction(lambda x: x.replace("*",""), StringType())
df = df.withColumn("State1", udf(col("State1")))
因为*有特殊含义(通配符),所以按照这个方法
from pyspark.sql import functions as F
>>> df2 = df.withColumn("CleanState",F.regexp_replace("State","\*",""))
>>> df2.show()
+-------+-----+----------+
| State|count|CleanState|
+-------+-----+----------+
| NT| 1423| NT|
| A*CT| 2868| ACT|
| SA|12242| SA|
| TAS| 4603| TAS|
| WA|35848| WA|
| *NZ| 806| NZ|
| QLD*|44410| QLD|
|missing| 2612| missing|
| VIC|40607| VIC|
| NSW|45195| NSW|
+-------+-----+----------+
我有一个值为 *NZ 的列,我想删除 *,
df.groupBy('State1').count().show()
(5) Spark Jobs
+-----------+-----+
| State1|count|
+-----------+-----+
| NT| 1423|
| ACT| 2868|
| SA|12242|
| TAS| 4603|
| WA|35848|
| *NZ| 806|
| QLD|44410|
| missing| 2612|
| VIC|40607|
| NSW|45195|
+-----------+-----+
这两个我都试过了
df = df.select("State1", f.translate(f.col("State1"), "*", ""))
df = df.withColumn('State1', regexp_replace('State1', '*',''))
第一个代码没有做任何事情
第二个代码运行但是当我显示抛出错误时
df.groupBy('State1').count().show()
Py4JJavaError Traceback (most recent call last)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 725.0 failed 1 times, most recent failure: Lost task 1.0 in stage 725.0 (TID 13480, localhost, executor driver): java.util.regex.PatternSyntaxException: Dangling meta character '*' near index 0
如何替换*
您可以在 "\*"
:
regexp_replace
执行此操作
from pyspark.sql import functions as F
df.withColumn("State1", F.regexp_replace("State1","\*","")).show()
+-------+-----+
| State1|count|
+-------+-----+
| NT| 1423|
| ACT| 2868|
| SA|12242|
| TAS| 4603|
| WA|35848|
| NZ| 806|
| QLD|44410|
|missing| 2612|
| VIC|40607|
| NSW|45195|
+-------+-----+
如 @mazaneicha
评论中所述。您也可以使用 replace
。
from pyspark.sql import functions as F
df.withColumn("State1", F.expr("""replace(state1,'*')""")).show()
这有效
from pyspark.sql.functions import UserDefinedFunction
udf = UserDefinedFunction(lambda x: x.replace("*",""), StringType())
df = df.withColumn("State1", udf(col("State1")))
因为*有特殊含义(通配符),所以按照这个方法
from pyspark.sql import functions as F
>>> df2 = df.withColumn("CleanState",F.regexp_replace("State","\*",""))
>>> df2.show()
+-------+-----+----------+
| State|count|CleanState|
+-------+-----+----------+
| NT| 1423| NT|
| A*CT| 2868| ACT|
| SA|12242| SA|
| TAS| 4603| TAS|
| WA|35848| WA|
| *NZ| 806| NZ|
| QLD*|44410| QLD|
|missing| 2612| missing|
| VIC|40607| VIC|
| NSW|45195| NSW|
+-------+-----+----------+