aws glue 默认处理新数据吗？

Question

查看 this 示例。它从 s3 目录读取数据，然后写回 s3 文件夹。但是如果我添加数据并重新运行这个作业呢？我是对的，aws glue 再次读写 all 数据？或者它只检测（如何？）新数据并只写入它？

顺便说一下，如果我从分区数据中读取，我必须自己指定 "new arrived" 分区吗？

Answer 1

从我在那个例子中看到的情况来看，他们正在从 S3 中的一个爬网位置读取数据，然后每次都替换一个文件，完全重新加载所有数据。

要仅处理新文件，您需要为作业启用 Bookmarks，并确保通过执行以下操作来提交作业：

args = getResolvedOptions(sys.argv, [‘TempDir’,’JOB_NAME’])
glue_context = GlueContext(SparkContext.getOrCreate()

# Instantiate your job object to later commit
job = Job(glue_context)
job.init(args[‘JOB_NAME’], args)

# Read file, if you enable Bookmark and commit at the end, this will only
# give you new files
dynamic_frame = glueContext.create_dynamic_frame.from_catalog(database = db_name, table_name = tbl_name)

result_dynamic_frame = # do some operations

# Append operation to create new parquet files from new data
result_dynamic_frame.toDF().write
  .mode("append")
  .parquet("s3://bucket/prefix/permit-inspections.parquet")

# Commit my job so next time we read, only new files will come in
job.commit()

希望对您有所帮助

aws glue 默认处理新数据吗？

Does aws glue handle new data by default?

amazon-s3

amazon-web-services

aws-glue