AWS Glue DynamicFrames 和下推谓词
AWS Glue DynamicFrames and Push Down Predicate
我正在为 AWS Glue 编写 ETL 脚本,该脚本来源于 S3 存储的 json 文件,我在其中创建 DynamicFrame 并尝试使用 pushDownPredicate 逻辑来限制传入的数据:
# Define the data restrictor predicate
now = str(int(round(time.time() * 1000)))
now_minus_7_date = datetime.datetime.now() - datetime.timedelta(days=7)
now_minus_7 = str(int(time.mktime(now_minus_7_date.timetuple()) * 1000))
last_7_predicate = "\"timestamp BETWEEN '" + now_minus_7 + "' AND '" + now + "'\""
print("Your predicate will be :" + last_7_predicate)
table 结构是包含分区(所有字符串)RegionalCenter、Year、Month、Day 和 Timestamp 的多列。我收到的错误消息是:
An error occurred while calling o70.getDynamicFrame. User's pushdown predicate: "timestamp BETWEEN '1550254844000' AND '1550859644703'" can not be resolved against partition columns: [regionalcenter,hour,year,timestamp,month,day]
我是 AWS Glue 和 Spark 的新手,话虽如此,我很困惑为什么不能针对实际上包含时间戳的分区列解析谓词时间戳。我已确保 table 中使用的时间戳以毫秒为单位。我们的 S3 结构的一个例子是:
regionalcenter=Missouri/Year=2019/Month=2/Day=11/Hour=22/Timestamp=1549924089246
DynamicFrame代码如下:
# Read data from table
dynamic_frame = glueContext.create_dynamic_frame.from_catalog(
database = args['DatabaseName'],
table_name = args['TableName'],
transformation_ctx = 'dynamic_frame',
push_down_predicate = last_7_predicate)
请让我知道这里还有什么可能对您有帮助。作为新手,我不完全确定还有什么有价值的。谢谢
啊,我引用的太多了。考虑解决这个问题:
last_7_predicate = "timestamp between '" + now_minus_7 + "' AND '" + now + "'"
我正在为 AWS Glue 编写 ETL 脚本,该脚本来源于 S3 存储的 json 文件,我在其中创建 DynamicFrame 并尝试使用 pushDownPredicate 逻辑来限制传入的数据:
# Define the data restrictor predicate
now = str(int(round(time.time() * 1000)))
now_minus_7_date = datetime.datetime.now() - datetime.timedelta(days=7)
now_minus_7 = str(int(time.mktime(now_minus_7_date.timetuple()) * 1000))
last_7_predicate = "\"timestamp BETWEEN '" + now_minus_7 + "' AND '" + now + "'\""
print("Your predicate will be :" + last_7_predicate)
table 结构是包含分区(所有字符串)RegionalCenter、Year、Month、Day 和 Timestamp 的多列。我收到的错误消息是:
An error occurred while calling o70.getDynamicFrame. User's pushdown predicate: "timestamp BETWEEN '1550254844000' AND '1550859644703'" can not be resolved against partition columns: [regionalcenter,hour,year,timestamp,month,day]
我是 AWS Glue 和 Spark 的新手,话虽如此,我很困惑为什么不能针对实际上包含时间戳的分区列解析谓词时间戳。我已确保 table 中使用的时间戳以毫秒为单位。我们的 S3 结构的一个例子是:
regionalcenter=Missouri/Year=2019/Month=2/Day=11/Hour=22/Timestamp=1549924089246
DynamicFrame代码如下:
# Read data from table
dynamic_frame = glueContext.create_dynamic_frame.from_catalog(
database = args['DatabaseName'],
table_name = args['TableName'],
transformation_ctx = 'dynamic_frame',
push_down_predicate = last_7_predicate)
请让我知道这里还有什么可能对您有帮助。作为新手,我不完全确定还有什么有价值的。谢谢
啊,我引用的太多了。考虑解决这个问题:
last_7_predicate = "timestamp between '" + now_minus_7 + "' AND '" + now + "'"