如何从 spark 连接到远程配置单元服务器

Question

我在本地运行 spark，想访问位于远程 Hadoop 集群中的 Hive 表。

我可以通过在 SPARK_HOME

下启动直线来访问配置单元表

[ml@master spark-2.0.0]$./bin/beeline 
Beeline version 1.2.1.spark2 by Apache Hive
beeline> !connect jdbc:hive2://remote_hive:10000
Connecting to jdbc:hive2://remote_hive:10000
Enter username for jdbc:hive2://remote_hive:10000: root
Enter password for jdbc:hive2://remote_hive:10000: ******
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/ml/spark/spark-2.0.0/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
16/10/12 19:06:39 INFO jdbc.Utils: Supplied authorities: remote_hive:10000
16/10/12 19:06:39 INFO jdbc.Utils: Resolved authority: remote_hive:10000
16/10/12 19:06:39 INFO jdbc.HiveConnection: Will try to open client transport with JDBC Uri: jdbc:hive2://remote_hive:10000
Connected to: Apache Hive (version 1.2.1000.2.4.2.0-258)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://remote_hive:10000>

如何从 spark 以编程方式访问远程配置单元表？

Answer 1

JDBC 不需要

Spark 直接连接到 Hive 元存储，而不是通过 HiveServer2。要配置它，

将 hive-site.xml 放在您的 classpath 上，并指定 hive.metastore.uris 到您的 hive metastore 托管的位置。另见
导入 org.apache.spark.sql.hive.HiveContext，因为它可以对 Hive table 执行 SQL 查询。
定义val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
验证 sqlContext.sql("show tables") 看是否有效

SparkSQL on Hive tables

结论：如果你必须采用jdbc方式

看看

请注意直线也通过 jdbc 连接。从您的日志中可以看出这一点。

[ml@master spark-2.0.0]$./bin/beeline Beeline version 1.2.1.spark2 by Apache Hive beeline> !connect jdbc:hive2://remote_hive:10000

Connecting to jdbc:hive2://remote_hive:10000

所以请看这个interesting article

方法 1：使用 JDBC
方法 2：将 Spark JdbcRDD 与 HiveServer2 JDBC 驱动程序一起使用
方法三：在客户端获取数据集，然后手动创建RDD

目前HiveServer2驱动不允许我们使用"Sparkling"方法1和2，我们只能依赖方法3

下面是可以实现的示例代码片段

通过 HiveServer2 JDBC 将数据从一个 Hadoop 集群（又名 "remote"）加载到另一个（我的 Spark 所在的地方又名 "domestic"） ]连接.

import java.sql.Timestamp
import scala.collection.mutable.MutableList

case class StatsRec (
  first_name: String,
  last_name: String,
  action_dtm: Timestamp,
  size: Long,
  size_p: Long,
  size_d: Long
)

val conn: Connection = DriverManager.getConnection(url, user, password)
val res: ResultSet = conn.createStatement
                   .executeQuery("SELECT * FROM stats_201512301914")
val fetchedRes = MutableList[StatsRec]()
while(res.next()) {
  var rec = StatsRec(res.getString("first_name"), 
     res.getString("last_name"), 
     Timestamp.valueOf(res.getString("action_dtm")), 
     res.getLong("size"), 
     res.getLong("size_p"), 
     res.getLong("size_d"))
  fetchedRes += rec
}
conn.close()
val rddStatsDelta = sc.parallelize(fetchedRes)
rddStatsDelta.cache()




 // Basically we are done. To check loaded data:

println(rddStatsDelta.count)
rddStatsDelta.collect.take(10).foreach(println)

Answer 2

向 SPARK 提供 hive-ste.xml 配置并启动 HIVE Metastore 服务后，

连接到 HIVE 时，需要在 SPARK 会话中配置两件事：

由于 Spark SQL 使用 thrift 连接到 Hive metastore，我们需要在创建 Spark 会话时提供 thrift 服务器 uri。
Hive Metastore 仓库 这是 Spark SQL 保存表的目录。使用对应于 'hive.metastore.warehouse.dir' 的属性 'spark.sql.warehouse.dir'（因为这在 Spark 2.0 中已弃用）

类似于：

    SparkSession spark=SparkSession.builder().appName("Spark_SQL_5_Save To Hive").enableHiveSupport().getOrCreate();
    spark.sparkContext().conf().set("spark.sql.warehouse.dir", "/user/hive/warehouse");
    spark.sparkContext().conf().set("hive.metastore.uris", "thrift://localhost:9083");

希望对您有所帮助！！

Answer 3

根据文档：

Note that the hive.metastore.warehouse.dir property in hive-site.xml is deprecated since Spark 2.0.0. Instead, use spark.sql.warehouse.dir to specify the default location of database in warehouse.

所以在 SparkSession 中你需要指定 spark.sql.uris 而不是 hive.metastore.uris

    from pyspark.sql import SparkSession
    spark = SparkSession \
        .builder \
        .appName("Python Spark SQL Hive integration example") \
        .config("spark.sql.uris", "thrift://<remote_ip>:9083") \
        .enableHiveSupport() \
        .getOrCreate()
    spark.sql("show tables").show()

如何从 spark 连接到远程配置单元服务器

How to connect to remote hive server from spark

hive

apache-spark

apache-spark-sql

spark-thriftserver

JDBC 不需要

结论：如果你必须采用jdbc方式