AWS Glue ETL 作业如何检索数据？

Question

我刚开始使用 AWS Glue，我不了解 ETL 作业如何收集数据。我使用爬虫从 S3 存储桶中的一些文件生成了我的 table 模式，并检查了 ETL 作业中自动生成的脚本，该脚本位于此处（略有修改）：

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "mydatabase", table_name = "mytablename", transformation_ctx = "datasource0")
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("data", "string", "data", "string")], transformation_ctx = "applymapping1")
datasink2 = glueContext.write_dynamic_frame.from_options(frame = applymapping1, connection_type = "s3", connection_options = {"path": "s3://myoutputbucket"}, format = "json", transformation_ctx = "datasink2")

当我运行这项工作时，它成功地从我的爬虫用来生成 table 模式的存储桶中获取我的数据，并按预期将数据放入我的目标 s3 存储桶中。

我的问题是：可以这么说，我在这个脚本中没有看到数据是 "loaded" 的任何地方。我知道我将它指向由爬虫生成的 table，但是来自 this doc:

Tables and databases in AWS Glue are objects in the AWS Glue Data Catalog. They contain metadata; they don't contain data from a data store.

如果 table 仅包含元数据，那么 ETL 作业如何从数据存储（在我的例子中是 S3 存储桶）中检索文件？我问这个问题主要是因为我想以某种方式修改 ETL 作业以转换不同存储桶中结构相同的文件，而无需编写新的爬虫，但也因为我想加强我对 Glue 服务的一般理解。

Answer 1

如果您深入了解 AWS Glue 数据目录。它有 tables 驻留在数据库下。通过单击这些 table，您将看到元数据，该元数据显示当前 table 由于搜寻器运行指向哪个 s3 文件夹。

您仍然可以在 s3 结构化文件上手动创建 tables，方法是通过数据目录选项添加 tables：

并将其指向您的 s3 位置。

另一种方法是使用 AWS-athena 控制台创建 tables 指向 s3 位置。您将使用一个常规的 create table 脚本，其位置字段包含您的 s3 位置。

Answer 2

主要要理解的是： Glue 数据源目录（datebases 和 tables）始终与 Athena 同步，这是一种无服务器查询服务，可以使用标准 SQL 轻松分析 Amazon S3 中的数据。您可以从 Glue 控制台/Athena 查询控制台创建 tables/databases。

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "mydatabase", table_name = "mytablename", transformation_ctx = "datasource0")

上面这行 Glue Spark 代码在使用 Glue 数据目录源 table 创建初始数据帧时发挥了神奇作用，除了元数据、架构和 table 属性外，它还具有Location 指向您的数据存储（s3 位置），您的数据所在。

在 applymapping 完成后，这部分代码（数据接收器）正在将数据实际加载到您的目标中 cluster/database。

datasink2 = glueContext.write_dynamic_frame.from_options(frame = applymapping1, connection_type = "s3", connection_options = {"path": "s3://myoutputbucket"}, format = "json", transformation_ctx = "datasink2")

AWS Glue ETL 作业如何检索数据？

How does AWS Glue ETL job retrieve data?

amazon-s3

amazon-web-services

aws-glue