从 pyspark 中的地图数组创建值列表
create list of values from array of maps in pyspark
我有一个table这样的
company_id | an_array_of_maps
--------------------------------------------------------------
234 | [{"a": "a2", "b": "b2"}, {"a": "a4", "b": "b2"}]
123 | [{"a": "a1", "b": "b1"}, {"a": "a1", "b": "b1"}]
678 | [{"b": "b5", "c": "c5"}, {"b": Null, "c": "c5"}]
我想得到这样的table(每个地图中“a”键的值)
company_id | an_array_of_maps
--------------------------------------------------------------
234 | ["a2", "a4"]
123 | ["a1", "a1"]
678 | ["b5", Null]
我试过了
df.withColumn("array_of_as", F.expr("filter(an_array_of_maps, x -> x.a)")).show()
但我收到以下错误:
AnalysisException: cannot resolve 'filter(`an_array_of_maps`, lambdafunction(namedlambdavariable()['a'], namedlambdavariable()))' due to data type mismatch: argument 2 requires boolean type, however, 'lambdafunction(namedlambdavariable()['a'], namedlambdavariable())' is of string type.;
知道了 - 过滤器是错误的函数。应该是:
(df
.withColumn("array_of_as",
F.expr("transform(an_array_of_maps, x -> x.a)"))
).show()
我没有过滤任何东西,我正在将地图列表转换为地图值列表 - 因此进行了转换。
我有一个table这样的
company_id | an_array_of_maps
--------------------------------------------------------------
234 | [{"a": "a2", "b": "b2"}, {"a": "a4", "b": "b2"}]
123 | [{"a": "a1", "b": "b1"}, {"a": "a1", "b": "b1"}]
678 | [{"b": "b5", "c": "c5"}, {"b": Null, "c": "c5"}]
我想得到这样的table(每个地图中“a”键的值)
company_id | an_array_of_maps
--------------------------------------------------------------
234 | ["a2", "a4"]
123 | ["a1", "a1"]
678 | ["b5", Null]
我试过了
df.withColumn("array_of_as", F.expr("filter(an_array_of_maps, x -> x.a)")).show()
但我收到以下错误:
AnalysisException: cannot resolve 'filter(`an_array_of_maps`, lambdafunction(namedlambdavariable()['a'], namedlambdavariable()))' due to data type mismatch: argument 2 requires boolean type, however, 'lambdafunction(namedlambdavariable()['a'], namedlambdavariable())' is of string type.;
知道了 - 过滤器是错误的函数。应该是:
(df
.withColumn("array_of_as",
F.expr("transform(an_array_of_maps, x -> x.a)"))
).show()
我没有过滤任何东西,我正在将地图列表转换为地图值列表 - 因此进行了转换。