Zeppelin 中的 AWS Redshift 驱动程序
AWS Redshift driver in Zeppelin
我想使用笔记本 Zeppelin 在 Redshift 中探索我的数据。一个带有 Spark 的小型 EMR 集群落后 运行。我正在加载数据块的 spark-redshift 库
%dep
z.reset()
z.load("com.databricks:spark-redshift_2.10:0.6.0")
然后
import org.apache.spark.sql.DataFrame
val query = "..."
val url = "..."
val port=5439
val table = "..."
val database = "..."
val user = "..."
val password = "..."
val df: DataFrame = sqlContext.read
.format("com.databricks.spark.redshift")
.option("url", s"jdbc:redshift://${url}:$port/$database?user=$user&password=$password")
.option("query",query)
.option("tempdir", "s3n://.../tmp/data")
.load()
df.show
但我收到错误
java.lang.ClassNotFoundException: Could not load an Amazon Redshift JDBC driver; see the README for instructions on downloading and configuring the official Amazon driver
我添加了选项
option("jdbcdriver", "com.amazon.redshift.jdbc41.Driver")
但并没有变得更好。我想我需要在某个地方指定 redshift 的 JDBC 驱动程序,就像我将 --driver-class-path 传递给 spark-shell 一样,但是如何使用 zeppelin 做到这一点?
您可以使用 Zeppelin 的 dependency-loading mechanism or, in case of Spark, using %dep
dynamic dependency loader
添加具有依赖关系的外部 jar,例如 JDBC 驱动程序
When your code requires external library, instead of doing download/copy/restart Zeppelin, you can easily do following jobs using %dep interpreter.
- Load libraries recursively from Maven repository
- Load libraries from local filesystem
- Add additional maven repository
- Automatically add libraries to SparkCluster (You can turn off)
后者看起来像:
%dep
// loads with all transitive dependencies from Maven repo
z.load("groupId:artifactId:version")
// or add artifact from filesystem
z.load("/path/to.jar")
并且按照惯例必须在注释的第一段中。
我想使用笔记本 Zeppelin 在 Redshift 中探索我的数据。一个带有 Spark 的小型 EMR 集群落后 运行。我正在加载数据块的 spark-redshift 库
%dep
z.reset()
z.load("com.databricks:spark-redshift_2.10:0.6.0")
然后
import org.apache.spark.sql.DataFrame
val query = "..."
val url = "..."
val port=5439
val table = "..."
val database = "..."
val user = "..."
val password = "..."
val df: DataFrame = sqlContext.read
.format("com.databricks.spark.redshift")
.option("url", s"jdbc:redshift://${url}:$port/$database?user=$user&password=$password")
.option("query",query)
.option("tempdir", "s3n://.../tmp/data")
.load()
df.show
但我收到错误
java.lang.ClassNotFoundException: Could not load an Amazon Redshift JDBC driver; see the README for instructions on downloading and configuring the official Amazon driver
我添加了选项
option("jdbcdriver", "com.amazon.redshift.jdbc41.Driver")
但并没有变得更好。我想我需要在某个地方指定 redshift 的 JDBC 驱动程序,就像我将 --driver-class-path 传递给 spark-shell 一样,但是如何使用 zeppelin 做到这一点?
您可以使用 Zeppelin 的 dependency-loading mechanism or, in case of Spark, using %dep
dynamic dependency loader
When your code requires external library, instead of doing download/copy/restart Zeppelin, you can easily do following jobs using %dep interpreter.
- Load libraries recursively from Maven repository
- Load libraries from local filesystem
- Add additional maven repository
- Automatically add libraries to SparkCluster (You can turn off)
后者看起来像:
%dep
// loads with all transitive dependencies from Maven repo
z.load("groupId:artifactId:version")
// or add artifact from filesystem
z.load("/path/to.jar")
并且按照惯例必须在注释的第一段中。