正则表达式在 presto SQL 中工作，但在 pyspark expr 中不工作

Question

数据：

+---+-------------------------------------------------------------+
|id |filters                                                      |
+---+-------------------------------------------------------------+
|1  |{"option.p.one":["A","B","C","D"], "option.p.type_two":["1"]}|
+---+-------------------------------------------------------------+

生成数据的代码：

columns = ["id","filters"]
data = [(1, '{"option.p.one":["A","B","C","D"], "option.p.type_two":["1"]}')]

rdd = sc.parallelize(data)

我写了两个正则表达式来提取和存储这个字符串的键和值。

键：\.([a-z_]+)\":'
值：:\[([^:]+)\

现在，当我运行以下代码时，我发现我的值正则表达式没有 return 预期的结果。

dfFromRDD2.withColumn("filter_category", expr(f"regexp_extract_all(filters, '\.([a-z_]+)\":', 1)"))\
.withColumn("filter_inputs", expr(f"regexp_extract_all(filters, ':\[([^:]+)\]', 0)")).show(truncate = False)


+---+-------------------------------------------------------------+---------------+-------------+
|id |filters                                                      |filter_category|filter_inputs|
+---+-------------------------------------------------------------+---------------+-------------+
|1  |{"option.p.one":["A","B","C","D"], "option.p.type_two":["1"]}|[one, type_two]|[:[, :[]     |
+---+-------------------------------------------------------------+---------------+-------------+

两种正则表达式在 Presto 中都可以正常工作 SQL

Answer 1

您需要转义 python 字符串中的特殊字符，例如

dfFromRDD2.withColumn("filter_category", expr(f"regexp_extract_all(filters, '\\.([a-z_]+)\\":', 1)"))\
.withColumn("filter_inputs", expr(f"regexp_extract_all(filters, ':\\[([^:]+)\\]', 0)")).show(truncate = False)

让我知道以上是否适合您。

正则表达式在 presto SQL 中工作，但在 pyspark expr 中不工作

Regex is working in presto SQL but is not working in pyspark expr

regex

sql

presto

pyspark