为行键放置不同值但在 hbase 中放置相同时间戳的解决方案？

Question

我是 Hbase 的新人。将文本文件中的数据批量加载到 Hbase 时遇到问题。假设我有以下 table:

Key_id | f1:c1 | f2:c2
row1     'a'     'b'
row1     'x'     'y'

当我解析 2 条记录并同时将其放入 Hbase（相同的时间戳）时，只有版本 {row1 'x' 'y'} 更新了。这是解释：

When you put data into HBase, a timestamp is required. The timestamp can be generated automatically by the RegionServer or can be supplied by you. The timestamp must be unique per version of a given cell, because the timestamp identifies the version. To modify a previous version of a cell, for instance, you would issue a Put with a different value for the data itself, but the same timestamp.

我正在考虑指定时间戳的想法，但我不知道如何为批量加载自动设置时间戳，它会影响加载性能吗？？我需要最快、最安全的大数据导入流程。

我尝试解析并把每条记录放入table，但是速度非常非常慢...所以另一个问题是：在放入之前应该批量处理多少records/size数据数据库。（我写了一个简单的java程序来放，比我用imporrtsv工具通过命令导入要慢很多，我不知道这个工具批量有多少..）

非常感谢您的建议！

Answer 1

如果需要覆盖记录，可以配置hbasetable只记住一个版本。

此页面介绍了如何以最大可能速度批量加载到 hbase：

How to use hbase bulk loading and why

Answer 2

Q1：Hbase使用时间戳维护版本。如果您不提供，它将采用 hbase 系统提供的默认值。

如果您有这样的需求，您也可以在put请求中更新自定义时间。它不会影响性能。

Q2：您可以通过两种方式完成。

简单 java 客户端，具有如下所示的批处理技术。
Mapreduce importtsv（批处理客户端）

示例：#1 简单 java 客户端，具有批处理技术。

我使用 hbase 将 100000 条记录的批处理列表对象用于解析json（类似于您的独立 csv 客户端）

下面是我实现此目的的代码片段。解析其他格式时也可以做同样的事情）

可能你需要在两个地方调用这个方法

1) 具有 100000 条记录的批次。

2) 处理提醒您的批次记录少于100000条

  public void addRecord(final ArrayList<Put> puts, final String tableName) throws Exception {
        try {
            final HTable table = new HTable(HBaseConnection.getHBaseConfiguration(), getTable(tableName));
            table.put(puts);
            LOG.info("INSERT record[s] " + puts.size() + " to table " + tableName + " OK.");
        } catch (final Throwable e) {
            e.printStackTrace();
        } finally {
            LOG.info("Processed ---> " + puts.size());
            if (puts != null) {
                puts.clear();
            }
        }
    }

注意：批处理大小在内部由 hbase.client.write.buffer 控制，如下面的一个配置 xmls

<property>
         <name>hbase.client.write.buffer</name>
         <value>20971520</value> // around 2 mb i guess
 </property>

默认值为 2mb 大小。一旦你的缓冲区被填满，它就会刷新所有实际插入到你的 table.

Furthermore, Either mapreduce client or stand alone client with batch technique. batching is controlled by above buffer property

为行键放置不同值但在 hbase 中放置相同时间戳的解决方案？

Solutions to put different values for a row-key but the same timestamps in hbase?

hadoop

timestamp

hbase

versions

bulk-load