aws glue 默认处理新数据吗?
Does aws glue handle new data by default?
查看 this 示例。它从 s3 目录读取数据,然后写回 s3 文件夹。但是如果我添加数据并重新运行这个作业呢?我是对的,aws glue 再次读写 all 数据?或者它只检测(如何?)新数据并只写入它?
顺便说一下,如果我从分区数据中读取,我必须自己指定 "new arrived" 分区吗?
从我在那个例子中看到的情况来看,他们正在从 S3 中的一个爬网位置读取数据,然后每次都替换一个文件,完全重新加载所有数据。
要仅处理新文件,您需要为作业启用 Bookmarks,并确保通过执行以下操作来提交作业:
args = getResolvedOptions(sys.argv, [‘TempDir’,’JOB_NAME’])
glue_context = GlueContext(SparkContext.getOrCreate()
# Instantiate your job object to later commit
job = Job(glue_context)
job.init(args[‘JOB_NAME’], args)
# Read file, if you enable Bookmark and commit at the end, this will only
# give you new files
dynamic_frame = glueContext.create_dynamic_frame.from_catalog(database = db_name, table_name = tbl_name)
result_dynamic_frame = # do some operations
# Append operation to create new parquet files from new data
result_dynamic_frame.toDF().write
.mode("append")
.parquet("s3://bucket/prefix/permit-inspections.parquet")
# Commit my job so next time we read, only new files will come in
job.commit()
希望对您有所帮助
查看 this 示例。它从 s3 目录读取数据,然后写回 s3 文件夹。但是如果我添加数据并重新运行这个作业呢?我是对的,aws glue 再次读写 all 数据?或者它只检测(如何?)新数据并只写入它?
顺便说一下,如果我从分区数据中读取,我必须自己指定 "new arrived" 分区吗?
从我在那个例子中看到的情况来看,他们正在从 S3 中的一个爬网位置读取数据,然后每次都替换一个文件,完全重新加载所有数据。
要仅处理新文件,您需要为作业启用 Bookmarks,并确保通过执行以下操作来提交作业:
args = getResolvedOptions(sys.argv, [‘TempDir’,’JOB_NAME’])
glue_context = GlueContext(SparkContext.getOrCreate()
# Instantiate your job object to later commit
job = Job(glue_context)
job.init(args[‘JOB_NAME’], args)
# Read file, if you enable Bookmark and commit at the end, this will only
# give you new files
dynamic_frame = glueContext.create_dynamic_frame.from_catalog(database = db_name, table_name = tbl_name)
result_dynamic_frame = # do some operations
# Append operation to create new parquet files from new data
result_dynamic_frame.toDF().write
.mode("append")
.parquet("s3://bucket/prefix/permit-inspections.parquet")
# Commit my job so next time we read, only new files will come in
job.commit()
希望对您有所帮助