无法使用 Spark 脚本将 Spark 数据集写入 HBase
Cannot write Spark dataset to HBase with Spark script
我正在尝试使用 Spark 写入 HBase table。我正在使用来自 link 的 HBase Spark Connector 示例。我使用 spark-shell
call
启动以下命令
$ spark-shell --jars /opt/cloudera/parcels/CDH/jars/hbase-spark-2.1.0-cdh6.2.1.jar,/opt/cloudera/parcels/CDH/jars/hbase-client-2.1.0-cdh6.2.1.jar
代码:
val sql = spark.sqlContext
import java.sql.Date
case class Person(name: String, email: String, birthDate: Date, height: Float)
var personDS = Seq(
Person("alice", "alice@alice.com", Date.valueOf("2000-01-01"), 4.5f),
Person("bob", "bob@bob.com", Date.valueOf("2001-10-17"), 5.1f)).
toDS
personDS.write.format("org.apache.hadoop.hbase.spark").
option("hbase.columns.mapping",
"name STRING :key, email STRING c:email, birthDate DATE p:birthDate, height FLOAT p:height") .
option("hbase.table", "test").
option("hbase.spark.use.hbasecontext", false).
option("spark.hadoop.validateOutputSpecs", false).
save()
例外是
java.lang.NullPointerException
at org.apache.hadoop.hbase.spark.HBaseRelation.<init>(DefaultSource.scala:139)
at org.apache.hadoop.hbase.spark.DefaultSource.createRelation(DefaultSource.scala:79)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand.apply(DataFrameWriter.scala:668)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand.apply(DataFrameWriter.scala:668)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
... 49 elided
异常的原因是什么,如何避免?
我怀疑这里会发生 NPE,因为 HBaseContext
应该在 HBase-Spark 连接器可以在 hbase:meta
您正在引用的 table 中查找并创建数据源之前正确初始化。 IE。按照 link 中的自定义 HBase 配置部分进行操作,例如:
import org.apache.hadoop.hbase.spark.HBaseContext
import org.apache.hadoop.hbase.HBaseConfiguration
new HBaseContext(spark.sparkContext, new HBaseConfiguration())
...
还有一种初始化方法HBaseContext
:
import org.apache.hadoop.fs.Path
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.spark.HBaseContext
val conf = HBaseConfiguration.create()
// use your actual path to hbase-site.xml
conf.addResource(new Path("/etc/hbase/conf.cloudera.hbase/hbase-site.xml"))
new HBaseContext(sc, conf)
我正在尝试使用 Spark 写入 HBase table。我正在使用来自 link 的 HBase Spark Connector 示例。我使用 spark-shell
call
$ spark-shell --jars /opt/cloudera/parcels/CDH/jars/hbase-spark-2.1.0-cdh6.2.1.jar,/opt/cloudera/parcels/CDH/jars/hbase-client-2.1.0-cdh6.2.1.jar
代码:
val sql = spark.sqlContext
import java.sql.Date
case class Person(name: String, email: String, birthDate: Date, height: Float)
var personDS = Seq(
Person("alice", "alice@alice.com", Date.valueOf("2000-01-01"), 4.5f),
Person("bob", "bob@bob.com", Date.valueOf("2001-10-17"), 5.1f)).
toDS
personDS.write.format("org.apache.hadoop.hbase.spark").
option("hbase.columns.mapping",
"name STRING :key, email STRING c:email, birthDate DATE p:birthDate, height FLOAT p:height") .
option("hbase.table", "test").
option("hbase.spark.use.hbasecontext", false).
option("spark.hadoop.validateOutputSpecs", false).
save()
例外是
java.lang.NullPointerException
at org.apache.hadoop.hbase.spark.HBaseRelation.<init>(DefaultSource.scala:139)
at org.apache.hadoop.hbase.spark.DefaultSource.createRelation(DefaultSource.scala:79)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand.apply(DataFrameWriter.scala:668)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand.apply(DataFrameWriter.scala:668)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
... 49 elided
异常的原因是什么,如何避免?
我怀疑这里会发生 NPE,因为 HBaseContext
应该在 HBase-Spark 连接器可以在 hbase:meta
您正在引用的 table 中查找并创建数据源之前正确初始化。 IE。按照 link 中的自定义 HBase 配置部分进行操作,例如:
import org.apache.hadoop.hbase.spark.HBaseContext
import org.apache.hadoop.hbase.HBaseConfiguration
new HBaseContext(spark.sparkContext, new HBaseConfiguration())
...
还有一种初始化方法HBaseContext
:
import org.apache.hadoop.fs.Path
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.spark.HBaseContext
val conf = HBaseConfiguration.create()
// use your actual path to hbase-site.xml
conf.addResource(new Path("/etc/hbase/conf.cloudera.hbase/hbase-site.xml"))
new HBaseContext(sc, conf)