使用键名过滤pyspark中的字典
Filter dictionary in pyspark with key names
给定一个类似于数据集中列的字典,我想从一个键中获取值,前提是另一个键的值得到满足。
示例:
假设我在数据集中有一列 'statistics',其中每个数据行如下所示:
array
0: {"hair": "black", "eye": "white", "metric": "feet"}
1: {"hair": "blue", "eye": "white", "metric": "m"}
2: {"hair": "red", "eye": "brown", "metric": "feet"}
3: {"hair": "yellow", "eye": "white", "metric": "cm"}
我想在头发为'black'
时获取'eye'的值
我试过了:
select
statistics.eye("*").filter(statistics.hair, x -> x == 'black')
from arrayData
但是报错,无法获取眼睛值,请协助。
您可以转换为数据帧并读取它。您也可以将其注册为 temptable 并读取为 sql
from pyspark.sql import functions as F
df=sc.parallelize([{"hair": "black", "eye": "white", "metric": "feet"},{"hair": "blue", "eye": "white", "metric": "m"},{"hair": "red", "eye": "brown", "metric": "feet"},{"hair": "yellow", "eye": "white", "metric": "cm"}]).toDF()
>>> df.show()
+-----+------+------+
| eye| hair|metric|
+-----+------+------+
|white| black| feet|
|white| blue| m|
|brown| red| feet|
|white|yellow| cm|
+-----+------+------+
>>> df.filter(F.col("hair") == 'black').show()
+-----+-----+------+
| eye| hair|metric|
+-----+-----+------+
|white|black| feet|
+-----+-----+------+
df.createOrReplaceTempView("data")
spark.sql("select * from data where hair ='black'")
我最终在不必先转换为数据帧的情况下弄明白了。
聚合命令允许您在满足另一个键的值的情况下从一个键中获取值。对于这种情况,下面的命令就足够了:
select
aggregate(statistics,"",(agg,item)->concat(agg,CASE WHEN item.hair == 'black' THEN item.eye ELSE "" END)) as EyeColor
from arrayData
有关如何使用此功能的详细信息,请参阅here
给定一个类似于数据集中列的字典,我想从一个键中获取值,前提是另一个键的值得到满足。
示例: 假设我在数据集中有一列 'statistics',其中每个数据行如下所示:
array
0: {"hair": "black", "eye": "white", "metric": "feet"}
1: {"hair": "blue", "eye": "white", "metric": "m"}
2: {"hair": "red", "eye": "brown", "metric": "feet"}
3: {"hair": "yellow", "eye": "white", "metric": "cm"}
我想在头发为'black'
时获取'eye'的值我试过了:
select
statistics.eye("*").filter(statistics.hair, x -> x == 'black')
from arrayData
但是报错,无法获取眼睛值,请协助。
您可以转换为数据帧并读取它。您也可以将其注册为 temptable 并读取为 sql
from pyspark.sql import functions as F
df=sc.parallelize([{"hair": "black", "eye": "white", "metric": "feet"},{"hair": "blue", "eye": "white", "metric": "m"},{"hair": "red", "eye": "brown", "metric": "feet"},{"hair": "yellow", "eye": "white", "metric": "cm"}]).toDF()
>>> df.show()
+-----+------+------+
| eye| hair|metric|
+-----+------+------+
|white| black| feet|
|white| blue| m|
|brown| red| feet|
|white|yellow| cm|
+-----+------+------+
>>> df.filter(F.col("hair") == 'black').show()
+-----+-----+------+
| eye| hair|metric|
+-----+-----+------+
|white|black| feet|
+-----+-----+------+
df.createOrReplaceTempView("data")
spark.sql("select * from data where hair ='black'")
我最终在不必先转换为数据帧的情况下弄明白了。
聚合命令允许您在满足另一个键的值的情况下从一个键中获取值。对于这种情况,下面的命令就足够了:
select
aggregate(statistics,"",(agg,item)->concat(agg,CASE WHEN item.hair == 'black' THEN item.eye ELSE "" END)) as EyeColor
from arrayData
有关如何使用此功能的详细信息,请参阅here