在 AWS Glue 3.0 中使用 1900 年之前的时间戳编写镶木地板时出现问题

Question

当从 Glue 2.0 切换到 3.0 时，也意味着从 Spark 2.4 切换到 3.1.1，在处理 1900 年之前的时间戳时，我的作业开始失败并出现此错误：

An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
You may get a different result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z from Parquet INT96 files can be ambiguous, 
as the files may be written by Spark 2.x or legacy versions of Hive, which uses a legacy hybrid calendar that is different from Spark 3.0+s Proleptic Gregorian calendar.
See more details in SPARK-31404.
You can set spark.sql.legacy.parquet.int96RebaseModeInRead to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during reading. 
Or set spark.sql.legacy.parquet.int96RebaseModeInRead to 'CORRECTED' to read the datetime values as it is.

我尝试了一切以在 Glue 中设置 int96RebaseModeInRead 配置，甚至联系了支持人员，但目前 Glue 似乎正在覆盖该标志，您无法自行设置。

如果有人知道解决方法，那就太好了。否则我将继续使用 Glue 2.0。并等待 Glue 开发团队修复此问题。

Answer 1

我通过将 --conf 设置为 spark.sql.legacy.parquet.int96RebaseModeInRead=CORRECTED --conf spark.sql.legacy.parquet.int96RebaseModeInWrite=CORRECTED --conf spark.sql.legacy.parquet.datetimeRebaseModeInRead=CORRECTED --conf spark.sql.legacy.parquet.datetimeRebaseModeInWrite=CORRECTED 使其工作。

虽然没有预计到达时间，但这是一种解决方法，Glue 开发团队正在努力修复。

此外，这仍然存在很多问题。例如，您不能在 DynamicFrame 上调用 .show()，您需要在 DataFrame 上调用它。另外，我所有的工作都失败了，我打电话给 data_frame.rdd.isEmpty()，不要问我为什么。

2021 年 11 月 24 日更新： 我联系了 Glue 开发团队，他们告诉我这是修复它的预期方法。不过，有一个可以在脚本内部完成的解决方法：

sc = SparkContext()
# Get current sparkconf which is set by glue
conf = sc.getConf()
# add additional spark configurations
conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.int96RebaseModeInWrite", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED")
# Restart spark context
sc.stop()
sc = SparkContext.getOrCreate(conf=conf)
# create glue context with the restarted sc
glueContext = GlueContext(sc)

Answer 2

官方 Glue 开发人员指南中解决的问题

Migrating from AWS Glue 2.0 to AWS Glue 3.0 最后一个项目符号项。

在 AWS Glue 3.0 中使用 1900 年之前的时间戳编写镶木地板时出现问题

Problems when writing parquet with timestamps prior to 1900 in AWS Glue 3.0

amazon-web-services

apache-spark

pyspark

aws-glue