AWS Glue 动态框架下推谓词列表

Question

在 AWS Glue 动态框架中使用下推谓词时，它如何遍历列表？

例如，创建了以下列表用作下推谓词：

day=list(p_day.select('day').toPandas()['day'])
month=list(p_month.select('month').na.drop().toPandas()['month'])
year=list(p_year.select('year').toPandas()['year'])

predicate = "day in (%s) and month in (%s) and year in (%s)"%(",".join(map(lambda s: "'"+str(s)+"'",dat))
                                                         ,",".join(map(lambda s: "'"+str(s)+"'",month))
                                                         ,",".join(map(lambda s: "'"+str(s)+"'",year)))

这么说吧returns这个：

"day in ('07','15') and month in ('11','09','08') and year in ('2021')"

下推谓词如何读取此 combination/list？

是吗：

day	month	year
07	11	2021
15	11	2021
07	09	2021
15	09	2021
07	08	2021
15	08	2021

-或-

day	month	year
07	11	2021
15	11	2021
15	08	2021
15	09	2021

我觉得这个列表读起来像第一个 table 而不是后者...但是，我想将后者作为下推谓词传递。创建列表本质上会导致排列吗？就好像真正的日月年组合丢失在列表中，应该是 11/7/2021、11/15/2021、08/15/2021 和 09/15/2021。

Answer 1

这与 Glue 本身无关，因为分区谓词只是基本的 Spark SQL。您将收到第一个列表，而不是第二个。您必须重组布尔表达式才能接收第二个列表。

AWS Glue 动态框架下推谓词列表

AWS Glue Dynamic Frame Pushdown Predicate List

pyspark

aws-glue

aws-glue-spark