HBase-Spark 连接器：为每次扫描建立与 HBase 的连接？

Question

我正在使用 Cloudera 的 HBase-Spark 连接器进行密集的 HBase 或 BigTable 扫描。它工作正常，但查看 Spark 的详细日志，看起来代码试图在每次调用时重新建立与 HBase 的连接，以处理我通过 JavaHBaseContext.foreachPartition() 执行的 Scan() 的结果。

我认为这段代码每次都重新建立与 HBase 的连接是否正确？如果是这样，我如何重写它以确保我重用已经建立的连接？

这是产生此行为的完整示例代码：

import org.apache.hadoop.hbase.client.ConnectionFactory;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.filter.FilterList;
import org.apache.hadoop.hbase.filter.FirstKeyOnlyFilter;
import org.apache.hadoop.hbase.filter.KeyOnlyFilter;
import org.apache.hadoop.hbase.filter.PageFilter;
import org.apache.hadoop.hbase.filter.PrefixFilter;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.spark.JavaHBaseContext;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.VoidFunction;

import scala.Tuple2;

import java.util.Iterator;

public class Main
{   
    public static void main(String args[]) throws Exception
    {

        SparkConf sc = new SparkConf().setAppName(Main.class.toString()).setMaster("local");        
        Configuration hBaseConf = HBaseConfiguration.create();
        Connection hBaseConn = ConnectionFactory.createConnection(hBaseConf);

        JavaSparkContext jSPContext = new JavaSparkContext(sc);
        JavaHBaseContext hBaseContext = new JavaHBaseContext(jSPContext, hBaseConf);

        int numTries = 5;
        byte rowKey[] = "ffec939d-bb21-4525-b1ff-f3143faae2".getBytes();
        for(int i = 0; i < numTries; i++)
        {
            Scan s = new Scan(rowKey);
            FilterList fList = new FilterList(FilterList.Operator.MUST_PASS_ALL);
            fList.addFilter(new KeyOnlyFilter());
            fList.addFilter(new FirstKeyOnlyFilter());
            fList.addFilter(new PageFilter(5));
            fList.addFilter(new PrefixFilter(rowKey));
            s.setFilter(fList);
            s.setCaching(5);            

            JavaRDD<Tuple2<ImmutableBytesWritable, Result>> scanRDD = hBaseContext
                    .hbaseRDD(hBaseConn.getTable(TableName.valueOf("FFUnits")).getName(), s);   

            hBaseContext.foreachPartition(scanRDD,  new VoidFunction<Tuple2<Iterator<Tuple2<ImmutableBytesWritable,Result>>, Connection>>(){
                private static final long serialVersionUID = 1L;
                public void call(Tuple2<Iterator<Tuple2<ImmutableBytesWritable,Result>>, Connection> t) throws Exception{
                    while (t._1().hasNext())
                        System.out.println("\tCurrent row: " + new String(t._1().next()._1.get()));
                }});
        }
    }
}

这是 Spark 日志的输出。对于循环的每 5 次迭代，此输出重复 5 次：

18/03/26 15:51:56 INFO client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x16261d615db0c5f
18/03/26 15:51:56 INFO zookeeper.ZooKeeper: Session: 0x16261d615db0c5f closed
18/03/26 15:51:56 INFO zookeeper.ClientCnxn: EventThread shut down
18/03/26 15:51:56 INFO executor.Executor: Finished task 0.0 in stage 3.0 (TID 3). 2044 bytes result sent to driver
18/03/26 15:51:56 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 3.0 (TID 3) in 300 ms on localhost (1/1)
18/03/26 15:51:56 INFO scheduler.DAGScheduler: ResultStage 3 (foreachPartition at HBaseContext.scala:98) finished in 0.301 s
18/03/26 15:51:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool 
18/03/26 15:51:56 INFO scheduler.DAGScheduler: Job 3 finished: foreachPartition at HBaseContext.scala:98, took 0.311925 s
18/03/26 15:51:56 INFO storage.MemoryStore: Block broadcast_9 stored as values in memory (estimated size 266.5 KB, free 1391.1 KB)
18/03/26 15:51:56 INFO storage.MemoryStore: Block broadcast_9_piece0 stored as bytes in memory (estimated size 20.7 KB, free 1411.8 KB)
18/03/26 15:51:56 INFO storage.BlockManagerInfo: Added broadcast_9_piece0 in memory on localhost:57171 (size: 20.7 KB, free: 457.8 MB)
18/03/26 15:51:56 INFO spark.SparkContext: Created broadcast 9 from NewHadoopRDD at NewHBaseRDD.scala:25
18/03/26 15:51:56 INFO zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0xc412556 connecting to ZooKeeper ensemble=hbase-3:2181
18/03/26 15:51:56 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=hbase-3:2181 sessionTimeout=90000 watcher=org.apache.hadoop.hbase.zookeeper.PendingWatcher@6f930e0
18/03/26 15:51:56 INFO zookeeper.ClientCnxn: Opening socket connection to server 144.240.189.35.bc.googleusercontent.com/35.189.240.144:2181. Will not attempt to authenticate using SASL (unknown error)
18/03/26 15:51:56 INFO zookeeper.ClientCnxn: Socket connection established to 144.240.189.35.bc.googleusercontent.com/35.189.240.144:2181, initiating session
18/03/26 15:51:56 INFO zookeeper.ClientCnxn: Session establishment complete on server 144.240.189.35.bc.googleusercontent.com/35.189.240.144:2181, sessionid = 0x16261d615db0c60, negotiated timeout = 90000
18/03/26 15:51:56 INFO util.RegionSizeCalculator: Calculating region sizes for table "FFUnits".
18/03/26 15:51:57 INFO client.ConnectionManager$HConnectionImplementation: Closing master protocol: MasterService
18/03/26 15:51:57 INFO client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x16261d615db0c60
18/03/26 15:51:57 INFO zookeeper.ZooKeeper: Session: 0x16261d615db0c60 closed
18/03/26 15:51:57 INFO zookeeper.ClientCnxn: EventThread shut down
18/03/26 15:51:57 INFO spark.SparkContext: Starting job: foreachPartition at HBaseContext.scala:98
18/03/26 15:51:57 INFO scheduler.DAGScheduler: Got job 4 (foreachPartition at HBaseContext.scala:98) with 1 output partitions
18/03/26 15:51:57 INFO scheduler.DAGScheduler: Final stage: ResultStage 4 (foreachPartition at HBaseContext.scala:98)
18/03/26 15:51:57 INFO scheduler.DAGScheduler: Parents of final stage: List()
18/03/26 15:51:57 INFO scheduler.DAGScheduler: Missing parents: List()
18/03/26 15:51:57 INFO scheduler.DAGScheduler: Submitting ResultStage 4 (MapPartitionsRDD[9] at map at HBaseContext.scala:427), which has no missing parents
18/03/26 15:51:57 INFO storage.MemoryStore: Block broadcast_10 stored as values in memory (estimated size 2.9 KB, free 1414.7 KB)
18/03/26 15:51:57 INFO storage.MemoryStore: Block broadcast_10_piece0 stored as bytes in memory (estimated size 1719.0 B, free 1416.4 KB)
18/03/26 15:51:57 INFO storage.BlockManagerInfo: Added broadcast_10_piece0 in memory on localhost:57171 (size: 1719.0 B, free: 457.8 MB)
18/03/26 15:51:57 INFO spark.SparkContext: Created broadcast 10 from broadcast at DAGScheduler.scala:1006
18/03/26 15:51:57 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 4 (MapPartitionsRDD[9] at map at HBaseContext.scala:427)
18/03/26 15:51:57 INFO scheduler.TaskSchedulerImpl: Adding task set 4.0 with 1 tasks
18/03/26 15:51:57 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 4.0 (TID 4, localhost, partition 0,ANY, 2611 bytes)
18/03/26 15:51:57 INFO executor.Executor: Running task 0.0 in stage 4.0 (TID 4)
18/03/26 15:51:57 INFO spark.NewHBaseRDD: Input split: HBase table split(table name: FFUnits, scan: GiJmZmVjOTM5ZC1iYjIxLTQ1MjUtYjFmZi1mMzE0M2ZhYWUyKqECCilvcmcuYXBhY2hlLmhhZG9v
cC5oYmFzZS5maWx0ZXIuRmlsdGVyTGlzdBLzAQgBEjIKLG9yZy5hcGFjaGUuaGFkb29wLmhiYXNl
LmZpbHRlci5LZXlPbmx5RmlsdGVyEgIIABI1CjFvcmcuYXBhY2hlLmhhZG9vcC5oYmFzZS5maWx0
ZXIuRmlyc3RLZXlPbmx5RmlsdGVyEgASLwopb3JnLmFwYWNoZS5oYWRvb3AuaGJhc2UuZmlsdGVy
LlBhZ2VGaWx0ZXISAggFElMKK29yZy5hcGFjaGUuaGFkb29wLmhiYXNlLmZpbHRlci5QcmVmaXhG
aWx0ZXISJAoiZmZlYzkzOWQtYmIyMS00NTI1LWIxZmYtZjMxNDNmYWFlMjgBQAGIAQU=, start row: ffec939d-bb21-4525-b1ff-f3143faae2, end row: , region location: 144.240.189.35.bc.googleusercontent.com, encoded region name: 2bce3b6bf780755d19fc4b610b17cf11)
18/03/26 15:51:57 INFO zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x46ac4a0 connecting to ZooKeeper ensemble=hbase-3:2181
18/03/26 15:51:57 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=hbase-3:2181 sessionTimeout=90000 watcher=org.apache.hadoop.hbase.zookeeper.PendingWatcher@5a8a2d2
18/03/26 15:51:57 INFO zookeeper.ClientCnxn: Opening socket connection to server 144.240.189.35.bc.googleusercontent.com/35.189.240.144:2181. Will not attempt to authenticate using SASL (unknown error)
18/03/26 15:51:57 INFO zookeeper.ClientCnxn: Socket connection established to 144.240.189.35.bc.googleusercontent.com/35.189.240.144:2181, initiating session
18/03/26 15:51:57 INFO zookeeper.ClientCnxn: Session establishment complete on server 144.240.189.35.bc.googleusercontent.com/35.189.240.144:2181, sessionid = 0x16261d615db0c61, negotiated timeout = 90000
18/03/26 15:51:57 INFO mapreduce.TableInputFormatBase: Input split length: 4 M bytes.
    Current row: ffec939d-bb21-4525-b1ff-f3143faae246*1*0049424a-5cea-46cb-a6b0-7c50d6465588
    Current row: ffec939d-bb21-4525-b1ff-f3143faae246*1*0082054a-b86a-4263-9753-025c1b0607be
    Current row: ffec939d-bb21-4525-b1ff-f3143faae246*1*00e21835-5dc6-4d82-8b8c-a4dcae4f14cd
    Current row: ffec939d-bb21-4525-b1ff-f3143faae246*1*01129620-a599-4fb7-9e2f-3492df1d06a3
    Current row: ffec939d-bb21-4525-b1ff-f3143faae246*1*035b3450-e523-4df6-a24f-11ebb29050f7

我的 hbse-site.xml 文件如下所示：

<configuration>
  <property>
    <name>hbase.zookeeper.quorum</name>
    <value>hbase-3</value>
  </property>
  <property>
    <name>hbase.zookeeper.property.clientPort</name>
    <value>2181</value>
  </property>
  <property>
    <name>timeout</name>
    <value>5000</value>
  </property>
</configuration>

我正在使用以下版本：

Spark v 1.6.2
HBase 1.3.1
Spark-HBase 1.2.0-cdh5.14.0

感谢您的帮助和建议！

Answer 1

这是一个常见问题。创建连接的成本可能会使您所做的实际工作相形见绌。

在 Cloud Bigtable 中，您可以在配置设置中将 google.bigtable.use.cached.data.channel.pool 设置为 true。这将显着提高性能。 Cloud Bigtable 最终为您的所有 Cloud Bigtable 实例使用单个 HTTP/2 端点。

我不知道 HBase 中有类似的构造，但一种方法是建议创建 Connection 的实现，在幕后创建一个缓存的 Connection。您必须将 hbase.client.connection.impl 设置为新的 class.

HBase-Spark 连接器：为每次扫描建立与 HBase 的连接？

HBase-Spark Connector: connection to HBase established for every scan?

hbase

apache-spark

google-cloud-bigtable