如何提高hbase中的扫描性能？

Question

我正在使用 hbase96 进行分析。我通过定义 startRow 和 endRow 对行键范围应用单列值过滤器来从 hbase 获取数据。

扫描 1500000 条记录需要 5-6 分钟，因为单个 request.It 未处理并发请求。
如何提高 hbase 扫描的性能？

我们在亚马逊上有 3 个数据节点和 2 个主节点。

下面是我的代码

Scan s = new Scan();
s.setCaching(10000);

s.setStartRow(Bytes.toBytes(start_date));
s.setStopRow(Bytes.toBytes(end_date));

FilterList filters = new FilterList(FilterList.Operator.MUST_PASS_ALL);

SingleColumnValueFilter filter = new SingleColumnValueFilter(
Bytes.toBytes("log"), Bytes.toBytes("ad_id"),
CompareOp.EQUAL, Bytes.toBytes(ad_id));
filters.addFilter(filter);

SingleColumnValueFilter filter = new SingleColumnValueFilter(
Bytes.toBytes("log"), Bytes.toBytes("advertiser_id"),
CompareOp.EQUAL, Bytes.toBytes(adver_id));
filters.addFilter(filter);

s.setFilter(filters);

ResultScanner rs = click_table.getScanner(s);

如何在协处理器中使用以上代码？

Answer 1

尝试在执行查询时设置 scan.setCaching(100000)。它指定每个 RPC 将传输到区域服务器的行数。

编辑： 另外，根据您的网络带宽尝试设置 batch and buffer sizes。每个应用程序都有不同的结构，需要不同的调整参数。尝试为您的数据调整这些值。

如果性能仍然相同..尝试使用并行获取数据。 This 可能会有帮助。

HTH

Answer 2

如果要根据列值进行扫描那么下面是最好的方法

Solr（CDH 搜索）https://wiki.apache.org/solr/
Hindex（基于协处理器的方法）https://communities.intel.com/community/itpeernetwork/datastack/blog/2013/10/30/coprocessor-based-secondary-index-on-hbase

如何提高hbase中的扫描性能？

how to improve scan performance in hbase?

hadoop

hbase