Alteryx regex_countmatches 在 PySpark 中等效?
Alteryx regex_countmatches equivalent in PySpark?
我正在将一些 alteryx 工作流迁移到 PySpark 任务,其中一部分遇到了以下过滤条件。
length([acc_id]) = 9
AND
(REGEX_CountMatches(right([acc_id],7),"[[:alpha:]]")=0 AND
REGEX_CountMatches(left([acc_id],2),"[[:alpha:]]")=2)
OR
(REGEX_CountMatches(right([acc_id],7),"[[:alpha:]]")=0 AND
REGEX_CountMatches(left([acc_id],1),"[[:alpha:]]")=1 AND
REGEX_CountMatches(right(left([acc_id],2),1), '9')=1
)
有人可以帮我在 PySpark 数据框中重写这个条件吗?
您可以使用 size
和 split
。您还需要对正则表达式使用 '[a-zA-Z]'
,因为 Spark 不支持像 "[[:alpha:]]"
这样的表达式。
例如,
REGEX_CountMatches(right([acc_id],7),"[[:alpha:]]")=0
应该等同于(在 Spark SQL 中)
size(split(right(acc_id, 7), '[a-zA-Z]')) - 1 = 0
您可以将 Spark SQL 字符串直接放入 Spark 数据帧的过滤器子句中:
df2 = df.filter("size(split(right(acc_id, 7), '[a-zA-Z]')) - 1 = 0")
您可以使用 length
with regexp_replace
获得 Alteryx 的 REGEX_CountMatches
函数的等价物:
REGEX_CountMatches(right([acc_id],7),"[[:alpha:]]")=0
变为:
# replace all non aplhapetic caracters with '' then get length
F.length(F.regexp_replace(F.expr("right(acc_id, 7)"), '[^A-Za-z]', '')) == 0
right
and left
函数仅在 SQL 中可用,您可以在 expr
.
中使用它们
完整示例:
from pyspark.sql import functions as F
df = spark.createDataFrame([("AB1234567",), ("AD234XG1234TT5",)], ["acc_id"])
def regex_count_matches(c: Column, regex: str) -> Column:
"""
helper function equivalent to REGEX_CountMatches
"""
return F.length(F.regexp_replace(c, regex, ''))
df.filter(
(F.length("acc_id") == 9) &
(
(regex_count_matches(F.expr("right(acc_id, 7)"), '[^A-Za-z]') == 0)
& (regex_count_matches(F.expr("left(acc_id, 2)"), '[^A-Za-z]') == 2)
) | (
(regex_count_matches(F.expr("right(acc_id, 7)"), '[^A-Za-z]') == 0)
& (regex_count_matches(F.expr("left(acc_id, 1)"), '[^A-Za-z]') == 1)
& (regex_count_matches(F.expr("right(left(acc_id, 2), 1)"), '[^9]') == 1)
)
).show()
#+---------+
#| acc_id|
#+---------+
#|AB1234567|
#+---------+
我正在将一些 alteryx 工作流迁移到 PySpark 任务,其中一部分遇到了以下过滤条件。
length([acc_id]) = 9
AND
(REGEX_CountMatches(right([acc_id],7),"[[:alpha:]]")=0 AND
REGEX_CountMatches(left([acc_id],2),"[[:alpha:]]")=2)
OR
(REGEX_CountMatches(right([acc_id],7),"[[:alpha:]]")=0 AND
REGEX_CountMatches(left([acc_id],1),"[[:alpha:]]")=1 AND
REGEX_CountMatches(right(left([acc_id],2),1), '9')=1
)
有人可以帮我在 PySpark 数据框中重写这个条件吗?
您可以使用 size
和 split
。您还需要对正则表达式使用 '[a-zA-Z]'
,因为 Spark 不支持像 "[[:alpha:]]"
这样的表达式。
例如,
REGEX_CountMatches(right([acc_id],7),"[[:alpha:]]")=0
应该等同于(在 Spark SQL 中)
size(split(right(acc_id, 7), '[a-zA-Z]')) - 1 = 0
您可以将 Spark SQL 字符串直接放入 Spark 数据帧的过滤器子句中:
df2 = df.filter("size(split(right(acc_id, 7), '[a-zA-Z]')) - 1 = 0")
您可以使用 length
with regexp_replace
获得 Alteryx 的 REGEX_CountMatches
函数的等价物:
REGEX_CountMatches(right([acc_id],7),"[[:alpha:]]")=0
变为:
# replace all non aplhapetic caracters with '' then get length
F.length(F.regexp_replace(F.expr("right(acc_id, 7)"), '[^A-Za-z]', '')) == 0
right
and left
函数仅在 SQL 中可用,您可以在 expr
.
完整示例:
from pyspark.sql import functions as F
df = spark.createDataFrame([("AB1234567",), ("AD234XG1234TT5",)], ["acc_id"])
def regex_count_matches(c: Column, regex: str) -> Column:
"""
helper function equivalent to REGEX_CountMatches
"""
return F.length(F.regexp_replace(c, regex, ''))
df.filter(
(F.length("acc_id") == 9) &
(
(regex_count_matches(F.expr("right(acc_id, 7)"), '[^A-Za-z]') == 0)
& (regex_count_matches(F.expr("left(acc_id, 2)"), '[^A-Za-z]') == 2)
) | (
(regex_count_matches(F.expr("right(acc_id, 7)"), '[^A-Za-z]') == 0)
& (regex_count_matches(F.expr("left(acc_id, 1)"), '[^A-Za-z]') == 1)
& (regex_count_matches(F.expr("right(left(acc_id, 2), 1)"), '[^9]') == 1)
)
).show()
#+---------+
#| acc_id|
#+---------+
#|AB1234567|
#+---------+