从 PySpark DataFrame 中的 Python 列表列表中删除一个元素

Question

我正在尝试从 Python 个列表列表中删除一个元素：

+---------------+
|        sources|
+---------------+
|           [62]|
|        [7, 32]|
|           [62]|
|   [18, 36, 62]|
|[7, 31, 36, 62]|
|    [7, 32, 62]|

我希望能够从上面列表中的每个列表中删除一个元素 rm。我写了一个可以为列表列表执行此操作的函数：

def asdf(df, rm):
    temp = df
    for n in range(len(df)):
        temp[n] = [x for x in df[n] if x != rm]
    return(temp)

确实删除了 rm = 1:

a = [[1,2,3],[1,2,3,4],[1,2,3,4,5]]
In:  asdf(a,1)
Out: [[2, 3], [2, 3, 4], [2, 3, 4, 5]]

但我无法让它为 DataFrame 工作：

asdfUDF = udf(asdf, ArrayType(IntegerType()))

In: df.withColumn("src_ex", asdfUDF("sources", 32))

Out: Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.col. Trace:
py4j.Py4JException: Method col([class java.lang.Integer]) does not exist

期望的行为：

In: df.withColumn("src_ex", asdfUDF("sources", 32))
Out: 

+---------------+
|         src_ex|
+---------------+
|           [62]|
|            [7]|
|           [62]|
|   [18, 36, 62]|
|[7, 31, 36, 62]|
|        [7, 62]|

（除了将上面的新列附加到 PySpark DataFrame，df）

有什么建议或想法吗？

Answer 1

Spark >= 2.4

您可以使用 array_remove:

from pyspark.sql.functions import array_remove

df.withColumn("src_ex", array_remove("sources", 32)).show()

+---------------+---------------+
|        sources|         src_ex|
+---------------+---------------+
|           [62]|           [62]|
|        [7, 32]|            [7]|
|           [62]|           [62]|
|   [18, 36, 62]|   [18, 36, 62]|
|[7, 31, 36, 62]|[7, 31, 36, 62]|
|    [7, 32, 62]|        [7, 62]|
+---------------+---------------+

或filter:

from pyspark.sql.functions import expr

df.withColumn("src_ex", expr("filter(sources, x -> not(x <=> 32))")).show()

+---------------+---------------+
|        sources|         src_ex|
+---------------+---------------+
|           [62]|           [62]|
|        [7, 32]|            [7]|
|           [62]|           [62]|
|   [18, 36, 62]|   [18, 36, 62]|
|[7, 31, 36, 62]|[7, 31, 36, 62]|
|    [7, 32, 62]|        [7, 62]|
+---------------+---------------+

Spark < 2.4

一些事情：

DataFrame 不是 列表的列表。实际上它甚至不是一个普通的 Python 对象，它没有 len 也不是 Iterable.
您的列看起来像普通 array 类型。
您不能引用 DataFrame（或 UDF 内的任何其他分布式数据结构）。
直接传递给 UDF 调用的每个参数都必须是 str（列名）或 Column 对象。要传递文字，请使用 lit 函数。

唯一剩下的就是列表理解：

from pyspark.sql.functions import lit, udf

def drop_from_array_(arr, item):
    return [x for x in arr if x != item]

drop_from_array = udf(drop_from_array_, ArrayType(IntegerType()))

用法示例：

df = sc.parallelize([
    [62], [7, 32], [62], [18, 36, 62], [7, 31, 36, 62], [7, 32, 62]
]).map(lambda x: (x, )).toDF(["sources"])

df.withColumn("src_ex", drop_from_array("sources", lit(32)))

结果：

+---------------+---------------+
|        sources|         src_ex|
+---------------+---------------+
|           [62]|           [62]|
|        [7, 32]|            [7]|
|           [62]|           [62]|
|   [18, 36, 62]|   [18, 36, 62]|
|[7, 31, 36, 62]|[7, 31, 36, 62]|
|    [7, 32, 62]|        [7, 62]|
+---------------+---------------+

从 PySpark DataFrame 中的 Python 列表列表中删除一个元素

Remove an element from a Python list of lists in PySpark DataFrame

python

apache-spark

apache-spark-sql

pyspark

pyspark-sql