正则表达式在 presto SQL 中工作,但在 pyspark expr 中不工作

Regex is working in presto SQL but is not working in pyspark expr

数据:

+---+-------------------------------------------------------------+
|id |filters                                                      |
+---+-------------------------------------------------------------+
|1  |{"option.p.one":["A","B","C","D"], "option.p.type_two":["1"]}|
+---+-------------------------------------------------------------+

生成数据的代码:

columns = ["id","filters"]
data = [(1, '{"option.p.one":["A","B","C","D"], "option.p.type_two":["1"]}')]

rdd = sc.parallelize(data)

我写了两个正则表达式来提取和存储这个字符串的键和值。

  1. 键:\.([a-z_]+)\":'
  2. 值::\[([^:]+)\

现在,当我 运行 以下代码时,我发现我的值正则表达式没有 return 预期的结果。

dfFromRDD2.withColumn("filter_category", expr(f"regexp_extract_all(filters, '\.([a-z_]+)\":', 1)"))\
.withColumn("filter_inputs", expr(f"regexp_extract_all(filters, ':\[([^:]+)\]', 0)")).show(truncate = False)


+---+-------------------------------------------------------------+---------------+-------------+
|id |filters                                                      |filter_category|filter_inputs|
+---+-------------------------------------------------------------+---------------+-------------+
|1  |{"option.p.one":["A","B","C","D"], "option.p.type_two":["1"]}|[one, type_two]|[:[, :[]     |
+---+-------------------------------------------------------------+---------------+-------------+

两种正则表达式在 Presto 中都可以正常工作 SQL

您需要转义 python 字符串中的特殊字符,例如

dfFromRDD2.withColumn("filter_category", expr(f"regexp_extract_all(filters, '\\.([a-z_]+)\\":', 1)"))\
.withColumn("filter_inputs", expr(f"regexp_extract_all(filters, ':\\[([^:]+)\\]', 0)")).show(truncate = False)

让我知道以上是否适合您。