在 AWS Glue 中读取镶木地板文件
Reading parquet files in AWS Glue
我是 AWS Glue 新手,正在尝试读取我在 S3 中拥有的一些镶木地板对象,但因 ClassNotFoundException 而失败。到目前为止,这是我基于 Glue 的最少文档所做的尝试:
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.JsonOptions
import org.apache.spark.sql.SparkSession
val gc: GlueContext = new GlueContext(sc)
val spark_session : SparkSession = gc.getSparkSession
val source = gc.getSource("s3", JsonOptions(Map("paths" -> Set("s3://path-to-parquet"))))
val parquetSource = source.withFormat("parquet")
parquetSource.getDynamicFrame().show(1)
例外情况:
18/06/11 13:39:11 WARN TaskSetManager: Lost task 0.0 in stage 6.0 (TID 266, ip-172-31-8-179.eu-west-1.compute.internal, executor 16): java.lang.ClassNotFoundException: Failed to load format with name parquet
at com.amazonaws.services.glue.util.ClassUtils$.loadByFullName(ClassUtils.scala:28)
at com.amazonaws.services.glue.util.ClassUtils$.getClassByName(ClassUtils.scala:43)
at com.amazonaws.services.glue.util.ClassUtils$.newInstanceByName(ClassUtils.scala:54)
at com.amazonaws.services.glue.readers.DynamicRecordStreamReader$.apply(DynamicRecordReader.scala:187)
...
Caused by: java.lang.ClassNotFoundException: parquet
at org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:82)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at com.amazonaws.services.glue.util.ClassUtils$$anonfun.apply(ClassUtils.scala:25)
at com.amazonaws.services.glue.util.ClassUtils$$anonfun.apply(ClassUtils.scala:25)
at scala.util.Try$.apply(Try.scala:192)
at com.amazonaws.services.glue.util.ClassUtils$.loadByFullName(ClassUtils.scala:25)
... 28 more
我解决了这个问题。我为 'getSource' 指定了错误的连接类型:它应该是 "parquet" 而不是 "s3":
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.JsonOptions
import org.apache.spark.sql.SparkSession
val gc: GlueContext = new GlueContext(sc)
val spark_session : SparkSession = gc.getSparkSession
val source = gc.getSource("parquet", JsonOptions(Map("paths" -> Set("s3://path-to-parquet"))))
source.getDynamicFrame().show(1)
希望这对某人有所帮助!
我是 AWS Glue 新手,正在尝试读取我在 S3 中拥有的一些镶木地板对象,但因 ClassNotFoundException 而失败。到目前为止,这是我基于 Glue 的最少文档所做的尝试:
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.JsonOptions
import org.apache.spark.sql.SparkSession
val gc: GlueContext = new GlueContext(sc)
val spark_session : SparkSession = gc.getSparkSession
val source = gc.getSource("s3", JsonOptions(Map("paths" -> Set("s3://path-to-parquet"))))
val parquetSource = source.withFormat("parquet")
parquetSource.getDynamicFrame().show(1)
例外情况:
18/06/11 13:39:11 WARN TaskSetManager: Lost task 0.0 in stage 6.0 (TID 266, ip-172-31-8-179.eu-west-1.compute.internal, executor 16): java.lang.ClassNotFoundException: Failed to load format with name parquet
at com.amazonaws.services.glue.util.ClassUtils$.loadByFullName(ClassUtils.scala:28)
at com.amazonaws.services.glue.util.ClassUtils$.getClassByName(ClassUtils.scala:43)
at com.amazonaws.services.glue.util.ClassUtils$.newInstanceByName(ClassUtils.scala:54)
at com.amazonaws.services.glue.readers.DynamicRecordStreamReader$.apply(DynamicRecordReader.scala:187)
...
Caused by: java.lang.ClassNotFoundException: parquet
at org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:82)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at com.amazonaws.services.glue.util.ClassUtils$$anonfun.apply(ClassUtils.scala:25)
at com.amazonaws.services.glue.util.ClassUtils$$anonfun.apply(ClassUtils.scala:25)
at scala.util.Try$.apply(Try.scala:192)
at com.amazonaws.services.glue.util.ClassUtils$.loadByFullName(ClassUtils.scala:25)
... 28 more
我解决了这个问题。我为 'getSource' 指定了错误的连接类型:它应该是 "parquet" 而不是 "s3":
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.JsonOptions
import org.apache.spark.sql.SparkSession
val gc: GlueContext = new GlueContext(sc)
val spark_session : SparkSession = gc.getSparkSession
val source = gc.getSource("parquet", JsonOptions(Map("paths" -> Set("s3://path-to-parquet"))))
source.getDynamicFrame().show(1)
希望这对某人有所帮助!