如何在 hbase 中进行预拆分

How can I pre split in hbase

我将数据存储在具有 5 个区域服务器的 hbase 中。我使用 url 的 md5 散列作为我的行键。目前所有数据都只存储在一个区域服务器中。所以我想预先拆分区域,以便数据在所有区域服务器上统一传输,以便数据在每个区域服务器上均匀传输。 我想将数据拆分为行的第一个字符 key.As 第一个字符是从 0 到 f(16 个字符)。与 rowkey 从 0 到 3 开始的数据一样,将进入第一个区域服务器,第二个区域服务器为 3-6,第三个区域为 6-9,第四个为 a-d,第五个为 d-f。我该怎么做?

您可以在创建 table 时提供 SPLITS 属性。

create 'tableName', 'cf1', {SPLITS => ['3','6','9','d']}

4个分割点会生成5个区域

请注意,HBase 的 DefaultLoadBalancer 不保证区域服务器之间 100% 均匀分布,一个区域服务器可能托管来自相同 table 的多个区域。

有关其工作原理的更多信息,请查看 this:

public List<RegionPlan> balanceCluster(Map<ServerName,List<HRegionInfo>> clusterState)

Generate a global load balancing plan according to the specified map of server information to the most loaded regions of each server. The load balancing invariant is that all servers are within 1 region of the average number of regions per server. If the average is an integer number, all servers will be balanced to the average. Otherwise, all servers will have either floor(average) or ceiling(average) regions. HBASE-3609 Modeled regionsToMove using Guava's MinMaxPriorityQueue so that we can fetch from both ends of the queue. At the beginning, we check whether there was empty region server just discovered by Master. If so, we alternately choose new / old regions from head / tail of regionsToMove, respectively. This alternation avoids clustering young regions on the newly discovered region server. Otherwise, we choose new regions from head of regionsToMove. Another improvement from HBASE-3609 is that we assign regions from regionsToMove to underloaded servers in round-robin fashion. Previously one underloaded server would be filled before we move onto the next underloaded server, leading to clustering of young regions. Finally, we randomly shuffle underloaded servers so that they receive offloaded regions relatively evenly across calls to balanceCluster(). The algorithm is currently implemented as such:

  1. Determine the two valid numbers of regions each server should have, MIN=floor(average) and MAX=ceiling(average).
  2. Iterate down the most loaded servers, shedding regions from each so each server hosts exactly MAX regions. Stop once you reach a server that already has <= MAX regions. Order the regions to move from most recent to least.
  3. Iterate down the least loaded servers, assigning regions so each server has exactly MIN regions. Stop once you reach a server that already has >= MIN regions. Regions being assigned to underloaded servers are those that were shed in the previous step. It is possible that there were not enough regions shed to fill each underloaded server to MIN. If so we end up with a number of regions required to do so, neededRegions. It is also possible that we were able to fill each underloaded but ended up with regions that were unassigned from overloaded servers but that still do not have assignment. If neither of these conditions hold (no regions needed to fill the underloaded servers, no regions leftover from overloaded servers), we are done and return. Otherwise we handle these cases below.
  4. If neededRegions is non-zero (still have underloaded servers), we iterate the most loaded servers again, shedding a single server from each (this brings them from having MAX regions to having MIN regions).
  5. We now definitely have more regions that need assignment, either from the previous step or from the original shedding from overloaded servers. Iterate the least loaded servers filling each to MIN. If we still have more regions that need assignment, again iterate the least loaded servers, this time giving each one (filling them to MAX) until we run out.
  6. All servers will now either host MIN or MAX regions. In addition, any server hosting >= MAX regions is guaranteed to end up with MAX regions at the end of the balancing. This ensures the minimal number of regions possible are moved.

TODO: We can at-most reassign the number of regions away from a particular server to be how many they report as most loaded. Should we just keep all assignment in memory? Any objections? Does this mean we need HeapSize on HMaster? Or just careful monitor? (current thinking is we will hold all assignments in memory)

如果您已经存储了所有数据,我建议您使用 hbase shell 将一些区域手动移动到另一个区域服务器。

hbase> move ‘ENCODED_REGIONNAME’, ‘SERVER_NAME’

Move a region. Optionally specify target regionserver else we choose one at random. NOTE: You pass the encoded region name, not the region name so this command is a little different to the others. The encoded region name is the hash suffix on region names: e.g. if the region name were TestTable,0094429456,1289497600452.527db22f95c8a9e0116f0cc13c680396. then the encoded region name portion is 527db22f95c8a9e0116f0cc13c680396 A server name is its host, port plus startcode. For example: host187.example.com,60020,1289493121758

如果您使用 Apache Phoenix 在 HBase 中创建 table,您可以在 CREATE 语句中指定 SALT_BUCKETS。 table 将分成与提到的桶一样多的区域。 Phoenix 计算 rowkey 的哈希值(很可能是一个数字哈希值 % SALT_BUCKETS)并将列单元分配给适当的区域。

CREATE TABLE IF NOT EXISTS us_population (
      state CHAR(2) NOT NULL,
      city VARCHAR NOT NULL,
      population BIGINT
      CONSTRAINT my_pk PRIMARY KEY (state, city)) SALT_BUCKETS=3;

这会将 table 预拆分为 3 个区域

或者,HBase 默认值 UI 允许您相应地拆分区域。