使用键名过滤pyspark中的字典

Question

给定一个类似于数据集中列的字典，我想从一个键中获取值，前提是另一个键的值得到满足。

示例：假设我在数据集中有一列 'statistics'，其中每个数据行如下所示：

array
0: {"hair": "black", "eye": "white", "metric": "feet"}
1: {"hair": "blue", "eye": "white", "metric": "m"}
2: {"hair": "red", "eye": "brown", "metric": "feet"}
3: {"hair": "yellow", "eye": "white", "metric": "cm"}

我想在头发为'black'

时获取'eye'的值

我试过了：

select
statistics.eye("*").filter(statistics.hair, x -> x == 'black')
from arrayData

但是报错，无法获取眼睛值，请协助。

Answer 1

您可以转换为数据帧并读取它。您也可以将其注册为 temptable 并读取为 sql

from pyspark.sql import functions as F

df=sc.parallelize([{"hair": "black", "eye": "white", "metric": "feet"},{"hair": "blue", "eye": "white", "metric": "m"},{"hair": "red", "eye": "brown", "metric": "feet"},{"hair": "yellow", "eye": "white", "metric": "cm"}]).toDF()
>>> df.show()
+-----+------+------+
|  eye|  hair|metric|
+-----+------+------+
|white| black|  feet|
|white|  blue|     m|
|brown|   red|  feet|
|white|yellow|    cm|
+-----+------+------+
>>> df.filter(F.col("hair") == 'black').show()
+-----+-----+------+
|  eye| hair|metric|
+-----+-----+------+
|white|black|  feet|
+-----+-----+------+
df.createOrReplaceTempView("data")
spark.sql("select * from data where hair ='black'")

Answer 2

我最终在不必先转换为数据帧的情况下弄明白了。

聚合命令允许您在满足另一个键的值的情况下从一个键中获取值。对于这种情况，下面的命令就足够了：

select 
aggregate(statistics,"",(agg,item)->concat(agg,CASE WHEN item.hair == 'black' THEN item.eye ELSE "" END)) as EyeColor
from arrayData

有关如何使用此功能的详细信息，请参阅here

使用键名过滤pyspark中的字典

Filter dictionary in pyspark with key names

sql

dictionary

apache-spark-sql

databricks