如何在 AWS Glue 中为 Scala Spark ETL 设置本地开发环境到 运行?
How to set up a local development environment for Scala Spark ETL to run in AWS Glue?
我希望能够在我的本地 IDE 中编写 Scala
,然后将其作为构建过程的一部分部署到 AWS Glue。但是我找不到构建 AWS 生成的 GlueApp
框架所需的库。
aws-java-sdk-glue doesn't contain the classes imported, and I can't find those libraries anywhere else. Though they must exist somewhere, but perhaps they are just a Java/Scala port of this library: aws-glue-libs
来自 AWS 的模板 scala 代码:
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.MappingSpec
import com.amazonaws.services.glue.errors.CallSite
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import org.apache.spark.SparkContext
import scala.collection.JavaConverters._
object GlueApp {
def main(sysArgs: Array[String]) {
val spark: SparkContext = new SparkContext()
val glueContext: GlueContext = new GlueContext(spark)
// @params: [JOB_NAME]
val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
Job.init(args("JOB_NAME"), glueContext, args.asJava)
// @type: DataSource
// @args: [database = "raw-tickers-oregon", table_name = "spark_delivery_2_1", transformation_ctx = "datasource0"]
// @return: datasource0
// @inputs: []
val datasource0 = glueContext.getCatalogSource(database = "raw-tickers-oregon", tableName = "spark_delivery_2_1", redshiftTmpDir = "", transformationContext = "datasource0").getDynamicFrame()
// @type: ApplyMapping
// @args: [mapping = [("exchangeid", "int", "exchangeid", "int"), ("data", "struct", "data", "struct")], transformation_ctx = "applymapping1"]
// @return: applymapping1
// @inputs: [frame = datasource0]
val applymapping1 = datasource0.applyMapping(mappings = Seq(("exchangeid", "int", "exchangeid", "int"), ("data", "struct", "data", "struct")), caseSensitive = false, transformationContext = "applymapping1")
// @type: DataSink
// @args: [connection_type = "s3", connection_options = {"path": "s3://spark-ticker-oregon/target", "compression": "gzip"}, format = "json", transformation_ctx = "datasink2"]
// @return: datasink2
// @inputs: [frame = applymapping1]
val datasink2 = glueContext.getSinkWithFormat(connectionType = "s3", options = JsonOptions("""{"path": "s3://spark-ticker-oregon/target", "compression": "gzip"}"""), transformationContext = "datasink2", format = "json").writeDynamicFrame(applymapping1)
Job.commit()
}
}
并且 build.sbt
我已经开始为本地构建组装:
name := "aws-glue-scala"
version := "0.1"
scalaVersion := "2.11.12"
updateOptions := updateOptions.value.withCachedResolution(true)
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.1"
AWS Glue Scala API seems to outline similar functionality as is available in the AWS Glue Python library. So perhaps all that is required is to download and build the PySpark AWS Glue library and add it on the classpath? Perhaps possible since the Glue python library uses Py4J 的文档。
很遗憾,没有可用于 Scala 胶水的库 API。已经联系亚马逊支持,他们知道这个问题。但是,他们没有提供任何交付 API jar 的预计到达时间。
作为解决方法,您可以从 S3 下载 jar。 S3 URI 是 s3://aws-glue-jes-prod-us-east-1-assets/etl/jars/glue-assembly.jar
见https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint-tutorial-repl.html
@Frederic 给出了一个非常有用的提示来从 s3://aws-glue-jes-prod-us-east-1-assets/etl/jars/glue-assembly.jar
获取依赖项。
不幸的是 glue-assembly.jar
的那个版本已经过时并且在版本 2.1
中带来了火花。
如果您使用的是向后兼容的功能,那很好,但是如果您依赖最新的 spark 版本(可能还有最新的胶水功能),您可以从 /usr/share/aws/glue/etl/jars/glue-assembly.jar
下的 Glue dev-endpoint 中获取合适的 jar。
如果您有一个名为 my-dev-endpoint
的开发端点,您可以从中复制当前的 jar:
export DEV_ENDPOINT_HOST=`aws glue get-dev-endpoint --endpoint-name my-dev-endpoint --query 'DevEndpoint.PublicAddress' --output text`
scp -i dev-endpoint-private-key \
glue@$DEV_ENDPOINT_HOST:/usr/share/aws/glue/etl/jars/glue-assembly.jar .
现在它支持 AWS 的最新版本。
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-libraries.html
我希望能够在我的本地 IDE 中编写 Scala
,然后将其作为构建过程的一部分部署到 AWS Glue。但是我找不到构建 AWS 生成的 GlueApp
框架所需的库。
aws-java-sdk-glue doesn't contain the classes imported, and I can't find those libraries anywhere else. Though they must exist somewhere, but perhaps they are just a Java/Scala port of this library: aws-glue-libs
来自 AWS 的模板 scala 代码:
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.MappingSpec
import com.amazonaws.services.glue.errors.CallSite
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import org.apache.spark.SparkContext
import scala.collection.JavaConverters._
object GlueApp {
def main(sysArgs: Array[String]) {
val spark: SparkContext = new SparkContext()
val glueContext: GlueContext = new GlueContext(spark)
// @params: [JOB_NAME]
val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
Job.init(args("JOB_NAME"), glueContext, args.asJava)
// @type: DataSource
// @args: [database = "raw-tickers-oregon", table_name = "spark_delivery_2_1", transformation_ctx = "datasource0"]
// @return: datasource0
// @inputs: []
val datasource0 = glueContext.getCatalogSource(database = "raw-tickers-oregon", tableName = "spark_delivery_2_1", redshiftTmpDir = "", transformationContext = "datasource0").getDynamicFrame()
// @type: ApplyMapping
// @args: [mapping = [("exchangeid", "int", "exchangeid", "int"), ("data", "struct", "data", "struct")], transformation_ctx = "applymapping1"]
// @return: applymapping1
// @inputs: [frame = datasource0]
val applymapping1 = datasource0.applyMapping(mappings = Seq(("exchangeid", "int", "exchangeid", "int"), ("data", "struct", "data", "struct")), caseSensitive = false, transformationContext = "applymapping1")
// @type: DataSink
// @args: [connection_type = "s3", connection_options = {"path": "s3://spark-ticker-oregon/target", "compression": "gzip"}, format = "json", transformation_ctx = "datasink2"]
// @return: datasink2
// @inputs: [frame = applymapping1]
val datasink2 = glueContext.getSinkWithFormat(connectionType = "s3", options = JsonOptions("""{"path": "s3://spark-ticker-oregon/target", "compression": "gzip"}"""), transformationContext = "datasink2", format = "json").writeDynamicFrame(applymapping1)
Job.commit()
}
}
并且 build.sbt
我已经开始为本地构建组装:
name := "aws-glue-scala"
version := "0.1"
scalaVersion := "2.11.12"
updateOptions := updateOptions.value.withCachedResolution(true)
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.1"
AWS Glue Scala API seems to outline similar functionality as is available in the AWS Glue Python library. So perhaps all that is required is to download and build the PySpark AWS Glue library and add it on the classpath? Perhaps possible since the Glue python library uses Py4J 的文档。
很遗憾,没有可用于 Scala 胶水的库 API。已经联系亚马逊支持,他们知道这个问题。但是,他们没有提供任何交付 API jar 的预计到达时间。
作为解决方法,您可以从 S3 下载 jar。 S3 URI 是 s3://aws-glue-jes-prod-us-east-1-assets/etl/jars/glue-assembly.jar
见https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint-tutorial-repl.html
@Frederic 给出了一个非常有用的提示来从 s3://aws-glue-jes-prod-us-east-1-assets/etl/jars/glue-assembly.jar
获取依赖项。
不幸的是 glue-assembly.jar
的那个版本已经过时并且在版本 2.1
中带来了火花。
如果您使用的是向后兼容的功能,那很好,但是如果您依赖最新的 spark 版本(可能还有最新的胶水功能),您可以从 /usr/share/aws/glue/etl/jars/glue-assembly.jar
下的 Glue dev-endpoint 中获取合适的 jar。
如果您有一个名为 my-dev-endpoint
的开发端点,您可以从中复制当前的 jar:
export DEV_ENDPOINT_HOST=`aws glue get-dev-endpoint --endpoint-name my-dev-endpoint --query 'DevEndpoint.PublicAddress' --output text`
scp -i dev-endpoint-private-key \
glue@$DEV_ENDPOINT_HOST:/usr/share/aws/glue/etl/jars/glue-assembly.jar .
现在它支持 AWS 的最新版本。
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-libraries.html