如何提取 PySpark 数据框中正则表达式模式的所有实例?
How can I extract all the instances of a regular expression pattern in PySpark dataframe?
我在 PySpark 数据框中有一个 StringType()
列。我想从该字符串中提取正则表达式模式的所有实例,并将它们放入 ArrayType(StringType())
的新列中
假设正则表达式模式是 [a-z]\*([0-9]\*)
输入 df:
stringValue
+----------+
a1234bc123
av1tb12h18
abcd
输出 df:
stringValue output
+-----------+-------------------+
a1234bc123 ['1234', '123']
av1tb12h18 ['1', '12', '18']
abcd []
尝试在 spark 中使用 functions
中的 split
和 array_remove
:
- 创建测试 DataFrame
from pyspark.sql import functions as F
df = spark.createDataFrame([("a1234bc123",), ("av1tb12h18",), ("abcd",)],["stringValue"])
df.show()
原始DataFrame:
+-----------+
|stringValue|
+-----------+
| a1234bc123|
| av1tb12h18|
| abcd|
+-----------+
- 使用
split
仅将字符串分隔成数字
df = df.withColumn("mid", F.split('stringValue', r'[a-zA-Z]'))
df.show()
输出:
+-----------+-----------------+
|stringValue| mid|
+-----------+-----------------+
| a1234bc123| [, 1234, , 123]|
| av1tb12h18|[, , 1, , 12, 18]|
| abcd| [, , , , ]|
+-----------+-----------------+
- 最后使用
array_remove
去除非数字元素
df = df.withColumn("output", F.array_remove('mid', ''))
df.show()
最终输出:
+-----------+-----------------+-----------+
|stringValue| mid| output|
+-----------+-----------------+-----------+
| a1234bc123| [, 1234, , 123]|[1234, 123]|
| av1tb12h18|[, , 1, , 12, 18]|[1, 12, 18]|
| abcd| [, , , , ]| []|
+-----------+-----------------+-----------+
可以使用功能模块regexp_replace and splitapi的组合
import pyspark.sql.types as t
import pyspark.sql.functions as f
l1 = [('anystring',),('a1234bc123',),('av1tb12h18',)]
df = spark.createDataFrame(l1).toDF('col')
df.show()
+----------+
| col|
+----------+
| anystring|
|a1234bc123|
|av1tb12h18|
+----------+
现在使用替换匹配的正则表达式,然后用“,”分割。这里 $1 指的是被替换的值,所以匹配正则表达式时它将为空。
e.g replace('anystring')
[=11=] = anystring
= ""
dfl1 = df.withColumn('temp', f.split(f.regexp_replace("col", "[a-z]*([0-9]*)", ","), ","))
dfl1.show()
+----------+---------------+
| col| temp|
+----------+---------------+
| anystring| [, , ]|
|a1234bc123|[1234, 123, , ]|
|av1tb12h18|[1, 12, 18, , ]|
+----------+---------------+
Spark <2.4
使用UDF替换数组的空值
def func_drop_from_array(arr):
return [x for x in arr if x != '']
drop_from_array = f.udf(func_drop_from_array, t.ArrayType(t.StringType()))
dfl1.withColumn('final', drop_from_array('temp')).show()
+----------+---------------+-----------+
| col| temp| final|
+----------+---------------+-----------+
| anystring| [, , ]| []|
|a1234bc123|[1234, 123, , ]|[1234, 123]|
|av1tb12h18|[1, 12, 18, , ]|[1, 12, 18]|
+----------+---------------+-----------+
Spark >=2.4
dfl1.withColumn('final', f.array_remove('temp','')).show()
+----------+---------------+-----------+
| col| temp| final|
+----------+---------------+-----------+
| anystring| [, , ]| []|
|a1234bc123|[1234, 123, , ]|[1234, 123]|
|av1tb12h18|[1, 12, 18, , ]|[1, 12, 18]|
+----------+---------------+-----------+
在 Spark 3.1+ regexp_extract_all
中可用。
regexp_extract_all(str, regexp[, idx])
- Extract all strings in the str
that match the regexp
expression and corresponding to the regex group index.
df = df.withColumn('output', F.expr("regexp_extract_all(stringValue, '[a-z]*([0-9]+)', 1)"))
df.show()
#+-----------+-----------+
#|stringValue| output|
#+-----------+-----------+
#| a1234bc123|[1234, 123]|
#| av1tb12h18|[1, 12, 18]|
#| abcd| []|
#+-----------+-----------+
我在 PySpark 数据框中有一个 StringType()
列。我想从该字符串中提取正则表达式模式的所有实例,并将它们放入 ArrayType(StringType())
假设正则表达式模式是 [a-z]\*([0-9]\*)
输入 df:
stringValue
+----------+
a1234bc123
av1tb12h18
abcd
输出 df:
stringValue output
+-----------+-------------------+
a1234bc123 ['1234', '123']
av1tb12h18 ['1', '12', '18']
abcd []
尝试在 spark 中使用 functions
中的 split
和 array_remove
:
- 创建测试 DataFrame
from pyspark.sql import functions as F
df = spark.createDataFrame([("a1234bc123",), ("av1tb12h18",), ("abcd",)],["stringValue"])
df.show()
原始DataFrame:
+-----------+
|stringValue|
+-----------+
| a1234bc123|
| av1tb12h18|
| abcd|
+-----------+
- 使用
split
仅将字符串分隔成数字
df = df.withColumn("mid", F.split('stringValue', r'[a-zA-Z]'))
df.show()
输出:
+-----------+-----------------+
|stringValue| mid|
+-----------+-----------------+
| a1234bc123| [, 1234, , 123]|
| av1tb12h18|[, , 1, , 12, 18]|
| abcd| [, , , , ]|
+-----------+-----------------+
- 最后使用
array_remove
去除非数字元素
df = df.withColumn("output", F.array_remove('mid', ''))
df.show()
最终输出:
+-----------+-----------------+-----------+
|stringValue| mid| output|
+-----------+-----------------+-----------+
| a1234bc123| [, 1234, , 123]|[1234, 123]|
| av1tb12h18|[, , 1, , 12, 18]|[1, 12, 18]|
| abcd| [, , , , ]| []|
+-----------+-----------------+-----------+
可以使用功能模块regexp_replace and splitapi的组合
import pyspark.sql.types as t
import pyspark.sql.functions as f
l1 = [('anystring',),('a1234bc123',),('av1tb12h18',)]
df = spark.createDataFrame(l1).toDF('col')
df.show()
+----------+
| col|
+----------+
| anystring|
|a1234bc123|
|av1tb12h18|
+----------+
现在使用替换匹配的正则表达式,然后用“,”分割。这里 $1 指的是被替换的值,所以匹配正则表达式时它将为空。
e.g replace('anystring')
[=11=] = anystring
= ""
dfl1 = df.withColumn('temp', f.split(f.regexp_replace("col", "[a-z]*([0-9]*)", ","), ","))
dfl1.show()
+----------+---------------+
| col| temp|
+----------+---------------+
| anystring| [, , ]|
|a1234bc123|[1234, 123, , ]|
|av1tb12h18|[1, 12, 18, , ]|
+----------+---------------+
Spark <2.4
使用UDF替换数组的空值
def func_drop_from_array(arr):
return [x for x in arr if x != '']
drop_from_array = f.udf(func_drop_from_array, t.ArrayType(t.StringType()))
dfl1.withColumn('final', drop_from_array('temp')).show()
+----------+---------------+-----------+
| col| temp| final|
+----------+---------------+-----------+
| anystring| [, , ]| []|
|a1234bc123|[1234, 123, , ]|[1234, 123]|
|av1tb12h18|[1, 12, 18, , ]|[1, 12, 18]|
+----------+---------------+-----------+
Spark >=2.4
dfl1.withColumn('final', f.array_remove('temp','')).show()
+----------+---------------+-----------+
| col| temp| final|
+----------+---------------+-----------+
| anystring| [, , ]| []|
|a1234bc123|[1234, 123, , ]|[1234, 123]|
|av1tb12h18|[1, 12, 18, , ]|[1, 12, 18]|
+----------+---------------+-----------+
在 Spark 3.1+ regexp_extract_all
中可用。
regexp_extract_all(str, regexp[, idx])
- Extract all strings in thestr
that match theregexp
expression and corresponding to the regex group index.
df = df.withColumn('output', F.expr("regexp_extract_all(stringValue, '[a-z]*([0-9]+)', 1)"))
df.show()
#+-----------+-----------+
#|stringValue| output|
#+-----------+-----------+
#| a1234bc123|[1234, 123]|
#| av1tb12h18|[1, 12, 18]|
#| abcd| []|
#+-----------+-----------+