在 Spark SQL regexp_replace 中使用 \P{C}

Question

我理解\P{C}代表"invisible control characters and unused code points" https://www.regular-expressions.info/unicode.html

当我这样做时，（在数据块笔记本中）它工作正常：

%sql
SELECT regexp_replace('abcd', '\P{C}', 'x')

但以下失败（%python 和 %scala）：

%python 
s = "SELECT regexp_replace('abcd', '\P{C}', 'x')"
display(spark.sql(s))

java.util.regex.PatternSyntaxException: Illegal repetition near index 0
P{C}
^

SQL 命令在 HIVE 中也能正常工作。我也尝试按照建议转义花括号，但没有用。

还有什么我想念的吗？谢谢。

Answer 1

Spark-Sql Api: 尝试添加 4 反斜杠 来转义 1\

spark.sql("SELECT regexp_replace('abcd', '\\P{C}', 'x')").show()
//+------------------------------+
//|regexp_replace(abcd, \P{C}, x)|
//+------------------------------+
//|                          xxxx|
//+------------------------------+

spark.sql("SELECT string('\\')").show()
//+-----------------+
//|CAST(\ AS STRING)|
//+-----------------+
//|                \|
//+-----------------+

(或)

启用 escapedStringLiterals 属性回退到 Spark-1.6 字符串文字

spark.sql("set spark.sql.parser.escapedStringLiterals=true")
spark.sql("SELECT regexp_replace('abcd', '\P{C}', 'x')").show()
//+------------------------------+
//|regexp_replace(abcd, \P{C}, x)|
//+------------------------------+
//|                          xxxx|
//+------------------------------+

在DataFrame-Api:中添加2反斜杠\到逃跑 1 \

df.withColumn("dd",regexp_replace(lit("abcd"), "\P{C}", "x")).show()
//+-----+----+
//|value|  dd|
//+-----+----+
//|    1|xxxx|
//+-----+----+

df.withColumn("dd",lit("\")).show()
//+-----+---+
//|value| dd|
//+-----+---+
//|    1|  \|
//+-----+---+

在 Spark SQL regexp_replace 中使用 \P{C}

Using \P{C} in Spark SQL regexp_replace

regex

unicode

apache-spark

apache-spark-sql

regexp-replace