使用键名过滤pyspark中的字典

Filter dictionary in pyspark with key names

给定一个类似于数据集中列的字典,我想从一个键中获取值,前提是另一个键的值得到满足。

示例: 假设我在数据集中有一列 'statistics',其中每个数据行如下所示:

array
0: {"hair": "black", "eye": "white", "metric": "feet"}
1: {"hair": "blue", "eye": "white", "metric": "m"}
2: {"hair": "red", "eye": "brown", "metric": "feet"}
3: {"hair": "yellow", "eye": "white", "metric": "cm"}

我想在头发为'black'

时获取'eye'的值

我试过了:

select
statistics.eye("*").filter(statistics.hair, x -> x == 'black')
from arrayData

但是报错,无法获取眼睛值,请协助。

您可以转换为数据帧并读取它。您也可以将其注册为 temptable 并读取为 sql

from pyspark.sql import functions as F

df=sc.parallelize([{"hair": "black", "eye": "white", "metric": "feet"},{"hair": "blue", "eye": "white", "metric": "m"},{"hair": "red", "eye": "brown", "metric": "feet"},{"hair": "yellow", "eye": "white", "metric": "cm"}]).toDF()
>>> df.show()
+-----+------+------+
|  eye|  hair|metric|
+-----+------+------+
|white| black|  feet|
|white|  blue|     m|
|brown|   red|  feet|
|white|yellow|    cm|
+-----+------+------+
>>> df.filter(F.col("hair") == 'black').show()
+-----+-----+------+
|  eye| hair|metric|
+-----+-----+------+
|white|black|  feet|
+-----+-----+------+
df.createOrReplaceTempView("data")
spark.sql("select * from data where hair ='black'")

我最终在不必先转换为数据帧的情况下弄明白了。

聚合命令允许您在满足另一个键的值的情况下从一个键中获取值。对于这种情况,下面的命令就足够了:

select 
aggregate(statistics,"",(agg,item)->concat(agg,CASE WHEN item.hair == 'black' THEN item.eye ELSE "" END)) as EyeColor
from arrayData

有关如何使用此功能的详细信息,请参阅here