希望在不使用 Explode 的情况下获取 ArrayType 列中的项目计数

Question

注意：我正在使用 Spark 2.4

这是我的数据集：

df

col
[1,3,1,4]
[1,1,1,2]

我想从本质上获取数组中值的 value_counts。结果 df wou

df_upd

col
[{1:2},{3:1},{4:1}]
[{1:3},{2:1}]

我知道我可以通过爆炸 df 然后组队来做到这一点，但我想知道我是否可以在不爆炸的情况下做到这一点。

Answer 1

这里有一个使用 udf 的解决方案，将结果输出为 MapType。它期望数组中的整数值（很容易更改）和 return 整数计数。

from pyspark.sql import functions as F
from pyspark.sql import types as T

df = sc.parallelize([([1, 2, 3, 3, 1],),([4, 5, 6, 4, 5],),([2, 2, 2],),([3, 3],)]).toDF(['arrays'])

df.show()

+---------------+
|         arrays|
+---------------+
|[1, 2, 3, 3, 1]|
|[4, 5, 6, 4, 5]|
|      [2, 2, 2]|
|         [3, 3]|
+---------------+

from collections import Counter

@F.udf(returnType=T.MapType(T.IntegerType(), T.IntegerType(), valueContainsNull=False))
def count_elements(array):
    return dict(Counter(array))

df.withColumn('counts', count_elements(F.col('arrays'))).show(truncate=False)

+---------------+------------------------+
|arrays         |counts                  |
+---------------+------------------------+
|[1, 2, 3, 3, 1]|[1 -> 2, 2 -> 1, 3 -> 2]|
|[4, 5, 6, 4, 5]|[4 -> 2, 5 -> 2, 6 -> 1]|
|[2, 2, 2]      |[2 -> 3]                |
|[3, 3]         |[3 -> 2]                |
+---------------+------------------------+

希望在不使用 Explode 的情况下获取 ArrayType 列中的项目计数

Looking to get counts of items within ArrayType column without using Explode

pyspark

apache-spark-sql