在 pyspark 数据框中过滤数组
Filter an array in pyspark dataframe
Spark 版本:2.3.0
我有一个包含数组列的 PySpark 数据框,我想通过应用一些字符串匹配条件来过滤数组元素。例如:如果我有这样的数据框
Array Col
['apple', 'banana', 'orange']
['strawberry', 'raspberry']
['apple', 'pineapple', 'grapes']
我想过滤每个数组中包含字符串 'apple' 或以 'app' 等开头的元素。我如何在 PySpark 中实现此目的?
谁能告诉我如何在 pyspark 中实现它?
您可以使用 spark 2.4+ 的高阶函数:
df.withColumn("Filtered_Col",F.expr(f"filter(Array_Col,x -> x rlike '^(?i)app' )")).show()
+--------------------------+------------+
|Array_Col |Filtered_Col|
+--------------------------+------------+
|[apple, banana, orange] |[apple] |
|[strawberry, raspberry] |[] |
|[apple, pineapple, grapes]|[apple] |
+--------------------------+------------+
对于较低版本,您可能最好使用 udf:
import re
def myf(v):
l=[]
for i in v:
if bool(re.match('^(?i)app',i)):
l.append(i)
return l
myudf = F.udf(myf,T.ArrayType(T.StringType()))
df.withColumn("Filtered_Col",myudf("Array_Col")).show()
您可以使用 filter in conjunction with exist which comes under Higher Order Functions 来检查数组中的任何元素是否包含 word
另一种方法是 UDF -
数据准备
sparkDF = sql.createDataFrame([(['apple', 'banana', 'orange'],),
(['strawberry', 'raspberry'],),
(['apple', 'pineapple', 'grapes'],)
]
,['arr_column']
)
sparkDF.show(truncate=False)
+--------------------------+
|arr_column |
+--------------------------+
|[apple, banana, orange] |
|[strawberry, raspberry] |
|[apple, pineapple, grapes]|
+--------------------------+
过滤器 & 存在 >= Spark 2.4
starts_with_app = lambda s: s.startswith("app")
sparkDF_filtered = sparkDF.filter(F.exists(F.col("arr_column"), starts_with_app))
sparkDF_filtered.show(truncate=False)
+--------------------------+
|arr_column |
+--------------------------+
|[apple, banana, orange] |
|[apple, pineapple, grapes]|
+--------------------------+
UDF - 低版本
def filter_string(inp):
res = []
for s in inp:
if s.startswith("app"):
res += [s]
if res:
return res
else:
return None
filter_string_udf = F.udf(lambda x: filter_string(x),ArrayType(StringType()))
sparkDF_filtered = sparkDF.withColumn('arr_filtered',filter_string_udf(F.col('arr_column')))\
.filter(F.col('arr_filtered').isNotNull())
sparkDF_filtered.show(truncate=False)
+--------------------------+------------+
|arr_column |arr_filtered|
+--------------------------+------------+
|[apple, banana, orange] |[apple] |
|[apple, pineapple, grapes]|[apple] |
+--------------------------+------------+
Spark 版本:2.3.0
我有一个包含数组列的 PySpark 数据框,我想通过应用一些字符串匹配条件来过滤数组元素。例如:如果我有这样的数据框
Array Col
['apple', 'banana', 'orange']
['strawberry', 'raspberry']
['apple', 'pineapple', 'grapes']
我想过滤每个数组中包含字符串 'apple' 或以 'app' 等开头的元素。我如何在 PySpark 中实现此目的?
谁能告诉我如何在 pyspark 中实现它?
您可以使用 spark 2.4+ 的高阶函数:
df.withColumn("Filtered_Col",F.expr(f"filter(Array_Col,x -> x rlike '^(?i)app' )")).show()
+--------------------------+------------+
|Array_Col |Filtered_Col|
+--------------------------+------------+
|[apple, banana, orange] |[apple] |
|[strawberry, raspberry] |[] |
|[apple, pineapple, grapes]|[apple] |
+--------------------------+------------+
对于较低版本,您可能最好使用 udf:
import re
def myf(v):
l=[]
for i in v:
if bool(re.match('^(?i)app',i)):
l.append(i)
return l
myudf = F.udf(myf,T.ArrayType(T.StringType()))
df.withColumn("Filtered_Col",myudf("Array_Col")).show()
您可以使用 filter in conjunction with exist which comes under Higher Order Functions 来检查数组中的任何元素是否包含 word
另一种方法是 UDF -
数据准备
sparkDF = sql.createDataFrame([(['apple', 'banana', 'orange'],),
(['strawberry', 'raspberry'],),
(['apple', 'pineapple', 'grapes'],)
]
,['arr_column']
)
sparkDF.show(truncate=False)
+--------------------------+
|arr_column |
+--------------------------+
|[apple, banana, orange] |
|[strawberry, raspberry] |
|[apple, pineapple, grapes]|
+--------------------------+
过滤器 & 存在 >= Spark 2.4
starts_with_app = lambda s: s.startswith("app")
sparkDF_filtered = sparkDF.filter(F.exists(F.col("arr_column"), starts_with_app))
sparkDF_filtered.show(truncate=False)
+--------------------------+
|arr_column |
+--------------------------+
|[apple, banana, orange] |
|[apple, pineapple, grapes]|
+--------------------------+
UDF - 低版本
def filter_string(inp):
res = []
for s in inp:
if s.startswith("app"):
res += [s]
if res:
return res
else:
return None
filter_string_udf = F.udf(lambda x: filter_string(x),ArrayType(StringType()))
sparkDF_filtered = sparkDF.withColumn('arr_filtered',filter_string_udf(F.col('arr_column')))\
.filter(F.col('arr_filtered').isNotNull())
sparkDF_filtered.show(truncate=False)
+--------------------------+------------+
|arr_column |arr_filtered|
+--------------------------+------------+
|[apple, banana, orange] |[apple] |
|[apple, pineapple, grapes]|[apple] |
+--------------------------+------------+