根据时间范围从HBasetable中删除所有数据?

Delete all data from HBase table according to time range?

我正在尝试从 HBase table 中删除所有数据,其时间戳早于指定的时间戳。这包含所有列族和行。

有没有一种方法可以使用 shell 以及 Java API 来完成?

Yes, this can be done easily by setting time range to scanner and then deleting the returned result set.

    public class BulkDeleteDriver {
    //Added colum family and column to lessen the scan I/O
    private static final byte[] COL_FAM = Bytes.toBytes("<column family>");
    private static final byte[] COL = Bytes.toBytes("column");
    final byte[] TEST_TABLE = Bytes.toBytes("<TableName>");

    public static void main(final String[] args) throws IOException,
    InterruptedException {
    //Create connection to Hbase
    Configuration conf = null;
    Connection conn = null;

    try {
    conf = HBaseConfiguration.create();
    //Path to HBase-site.xml
    conf.addResource(new Path(hbasepath));
    //Get the connection
    conn = ConnectionFactory.createConnection(conf);
    logger.info("Connection created successfully");
    } 
    catch (Exception e) {
    logger.error(e + "Connection Unsuccessful");
    }

    //Get the table instance
    Table table = conn.getTable(TableName.valueOf(TEST_TABLE));
    List<Delete> listOfBatchDeletes = new ArrayList<Delete>();
    long recordCount = 0;
    // Set scanCache if required
    logger.info("Got The Table : " + table.getName());

    //Get calendar instance and get proper start and end timestamps
    Calendar calStart = Calendar.getInstance();
    calStart.add(Calendar.DAY_OF_MONTH, day);
    Calendar calEnd = Calendar.getInstance();
    calEnd.add(Calendar.HOUR, hour);

    //Get timestamps
    long starTS = calStart.getTimeInMillis();
    long endTS = calEnd.getTimeInMillis();

    //Set all scan related properties
    Scan scan = new Scan();
    //Most important part of code set it properly!
    //here my purpose it to delete everthing Present Time - 6 hours
    scan.setTimeRange(starTS, endTS);
    scan.setCaching(scanCache);
    scan.addColumn(COL_FAM, COL);

    //Scan the table and get the row keys
    ResultScanner resultScanner = table.getScanner(scan);
    for (Result scanResult : resultScanner) {
    Delete delete = new Delete(scanResult.getRow());

    //Create batches of Bult Delete
    listOfBatchDeletes.add(delete);
    recordCount++;
    if (listOfBatchDeletes.size() == //give any suitable batch size here) {
    System.out.println("Firing Batch Delete Now......");
    table.delete(listOfBatchDeletes);
    //don't forget to clear the array list
    listOfBatchDeletes.clear();
    }}
    System.out.println("Firing Final Batch of Deletes.....");
    table.delete(listOfBatchDeletes);
    System.out.println("Total Records Deleted are.... " + recordCount);
    try {
    table.close();
    } catch (Exception e) {
    e.printStackTrace();
    logger.error("ERROR", e);
    }}}

HBase没有范围删除标记的概念。这意味着如果您需要删除多个单元格,则需要为每个单元格放置删除标记,这意味着您必须在客户端或服务器端扫描每一行。这意味着您有两个选择:

  1. BulkDeleteProtocol :这使用协处理器端点,这意味着完整的操作将 运行 在服务器端进行。 link 有一个如何使用它的例子。如果您进行网络搜索,您可以轻松找到如何在 HBase 中启用协处理器端点。
  2. 扫描并删除:这是一个干净且最简单的选项。既然你说你需要删除所有早于特定时间戳的列族,扫描和删除操作可以通过使用服务器端过滤只读取每行的第一个键来大大优化。

    Scan scan = new Scan();
    scan.setTimeRange(0, STOP_TS);  // STOP_TS: The timestamp in question
    // Crucial optimization: Make sure you process multiple rows together
    scan.setCaching(1000);
    // Crucial optimization: Retrieve only row keys
    FilterList filters = new FilterList(FilterList.Operator.MUST_PASS_ALL,
        new FirstKeyOnlyFilter(), new KeyOnlyFilter());
    scan.setFilter(filters);
    ResultScanner scanner = table.getScanner(scan);
    List<Delete> deletes = new ArrayList<>(1000);
    Result [] rr;
    do {
      // We set caching to 1000 above
      // make full use of it and get next 1000 rows in one go
      rr = scanner.next(1000);
      if (rr.length > 0) {
        for (Result r: rr) {
          Delete delete = new Delete(r.getRow(), STOP_TS);
          deletes.add(delete);
        }
        table.delete(deletes);
        deletes.clear();
      }
    } while(rr.length > 0);
    

如果想把shell的数据去掉,又不想写Java客户端,那么可以进行如下操作:

#!/bin/bash
start_time=1607731200000
end_time=1607817600000

row_key_file="/tmp/$start_time-$end_time.rowkey"
touch $row_key_file
now=$(date +'%Y-%m-%d:%H-%M-%S')

echo "$now: scanning records from date range $start_time to $end_time"
echo -e "scan 'YOUR_TABLE_NAME', {TIMERANGE => [$start_time, $end_time]}" | hbase shell -n | awk -F ' ' '{if(length() > 20){print }}' > $row_key_file

rows_scanned=$(wc -l $row_key_file | cut -d' ' -f1)
echo "Rows scanned: $rows_scanned"
echo "deleting rows"

echo -e "File.foreach('$row_key_file') { |line| key=line.strip; deleteall 'extract_job_results', key; }" | hbase shell -n
now=$(date +'%Y-%m-%d:%H-%M-%S')
echo "$now: Data truncation completed"

start_time 和 end_time 是开始和结束时间范围的以毫秒为单位的纪元。这将删除时间范围内的所有行。