如何从 Spark.scala 访问 HBase？是否有明确定义的scala api？

Question

如何从Spark.scala访问HBase？有没有明确定义的scala api？我正在查看数据帧级别而不是 RDD。

网络上有很多选项，例如 Apache HBase 连接器 SparkOnHBase 还有更多选择。

但如果知道或使用业内最常用的方法就好了。

感谢您的帮助。

Answer 1

Hortonworks 的 Spark-Hbase 连接器广泛用于从 Spark 访问 HBase。它在低级 RDD 和 Dataframes 中提供 API。

连接器要求您为 HBase 定义架构 table。下面是为 HBase table 定义的架构示例，名称为 table1，行键作为键，列数 (col1-col8)。需要注意的是，rowkey也必须详细定义为列（col0），它有一个特定的cf（rowkey）。

def catalog = s"""{
        |"table":{"namespace":"default", "name":"table1"},
        |"rowkey":"key",
        |"columns":{
          |"col0":{"cf":"rowkey", "col":"key", "type":"string"},
          |"col1":{"cf":"cf1", "col":"col1", "type":"boolean"},
          |"col2":{"cf":"cf2", "col":"col2", "type":"double"},
          |"col3":{"cf":"cf3", "col":"col3", "type":"float"},
          |"col4":{"cf":"cf4", "col":"col4", "type":"int"},
          |"col5":{"cf":"cf5", "col":"col5", "type":"bigint"},
          |"col6":{"cf":"cf6", "col":"col6", "type":"smallint"},
          |"col7":{"cf":"cf7", "col":"col7", "type":"string"},
          |"col8":{"cf":"cf8", "col":"col8", "type":"tinyint"}
        |}
      |}""".stripMargin

将 HBase table 作为 Dataframe 读取：

val df = spark
  .read
  .options(Map(HBaseTableCatalog.tableCatalog->cat))
  .format("org.apache.spark.sql.execution.datasources.hbase")
  .load()

将 Dataframe 写入 HBase table:

df.write.options(
  Map(HBaseTableCatalog.tableCatalog -> catalog, HBaseTableCatalog.newTable -> "5"))
  .format("org.apache.spark.sql.execution.datasources.hbase")
  .save()

更多详情：https://github.com/hortonworks-spark/shc

如何从 Spark.scala 访问 HBase？是否有明确定义的scala api？

How to access HBase from Spark.scala? is there clear defined scala api?

hbase

scala

apache-spark