Python 以列表作为搜索变量的正则表达式
Python RegEx with List as search variables
我有一个数据框,其列 email_adress_raw
每行包含多个电子邮件地址,我想创建一个新列,其中第一个电子邮件地址具有在长列表中列出的特定电子邮件结尾。
email_endings = ['email_end1.com','email_end2.com','email_end3.com',...]
我创建了以下函数,它已经在运行,但是由于列表很长并且一直在构建中,我想在代码或类似的东西中对列表进行迭代。我已经想到了一个循环,但不知怎么的我没能做到...
def email_address_new(s):
try:
r = re.search("([\w.-]+@"+email_endings[0]+"|[\w.-]+@"+email_endings[1]+"|[\w.-]+@"+email_endings[2]+")", s).group()
except AttributeError:
print(s)
return None
except TypeError:
print(s)
return None
return r
udf_email_address_new= F.udf(email_address_new, StringType())
df = df.withColumn("email", udf_email_address_new(F.col("email_adress_raw")))
您可以使用 join
将列表中的电子邮件结尾组合为正则表达式模式:
email_endings = ['email_end1.com','email_end2.com','email_end3.com']
def email_address_new(s):
try:
pattern = "([\w.-]+@" + "|[\w.-]+@".join(email_endings) + ")"
r = re.search(pattern, s).group()
except AttributeError:
print(s)
return None
except TypeError:
print(s)
return None
return r
udf_email_address_new= F.udf(email_address_new, StringType())
df2 = df.withColumn("email", udf_email_address_new(F.col("email_adress_raw")))
但是您可能不需要为此目的使用 UDF。您可以只使用 regexp_extract
,如果没有匹配,则用 null
替换空字符串(如果不匹配,regexp_extract
returns 为空字符串)
import pyspark.sql.functions as F
email_endings = ['email_end1.com','email_end2.com','email_end3.com']
pattern = "([\w.-]+@" + "|[\w.-]+@".join(email_endings) + ")"
df2 = df.withColumn(
"email",
F.when(
F.regexp_extract(F.col("email_adress_raw"), pattern, 1) != "",
F.regexp_extract(F.col("email_adress_raw"), pattern, 1)
)
)
我有一个数据框,其列 email_adress_raw
每行包含多个电子邮件地址,我想创建一个新列,其中第一个电子邮件地址具有在长列表中列出的特定电子邮件结尾。
email_endings = ['email_end1.com','email_end2.com','email_end3.com',...]
我创建了以下函数,它已经在运行,但是由于列表很长并且一直在构建中,我想在代码或类似的东西中对列表进行迭代。我已经想到了一个循环,但不知怎么的我没能做到...
def email_address_new(s):
try:
r = re.search("([\w.-]+@"+email_endings[0]+"|[\w.-]+@"+email_endings[1]+"|[\w.-]+@"+email_endings[2]+")", s).group()
except AttributeError:
print(s)
return None
except TypeError:
print(s)
return None
return r
udf_email_address_new= F.udf(email_address_new, StringType())
df = df.withColumn("email", udf_email_address_new(F.col("email_adress_raw")))
您可以使用 join
将列表中的电子邮件结尾组合为正则表达式模式:
email_endings = ['email_end1.com','email_end2.com','email_end3.com']
def email_address_new(s):
try:
pattern = "([\w.-]+@" + "|[\w.-]+@".join(email_endings) + ")"
r = re.search(pattern, s).group()
except AttributeError:
print(s)
return None
except TypeError:
print(s)
return None
return r
udf_email_address_new= F.udf(email_address_new, StringType())
df2 = df.withColumn("email", udf_email_address_new(F.col("email_adress_raw")))
但是您可能不需要为此目的使用 UDF。您可以只使用 regexp_extract
,如果没有匹配,则用 null
替换空字符串(如果不匹配,regexp_extract
returns 为空字符串)
import pyspark.sql.functions as F
email_endings = ['email_end1.com','email_end2.com','email_end3.com']
pattern = "([\w.-]+@" + "|[\w.-]+@".join(email_endings) + ")"
df2 = df.withColumn(
"email",
F.when(
F.regexp_extract(F.col("email_adress_raw"), pattern, 1) != "",
F.regexp_extract(F.col("email_adress_raw"), pattern, 1)
)
)