AWS S3 上的 HBase HFile 损坏

HBase HFile Corruption on AWS S3

我 运行 HBase EMR 集群 (emr-5.7.0) 在 S3 上启用。 我们正在使用“ImportTsv”和“CompleteBulkLoad”用于将数据导入 HBase 的实用程序。 在我们的过程中,我们观察到间歇性地出现故障,表明某些导入文件存在 HFile corruption。这种情况偶尔会发生,并且没有我们可以推断出错误的模式。


经过大量研究并参考了博客中的许多建议,我尝试了以下修复方法但无济于事,我们仍然面临着差异。

Tech Stack :

  • AWS EMR Cluster (emr-5.7.0 | r3.8xlarge | 15 nodes)

  • AWS S3

  • HBase 1.3.1


Data Volume:

  • ~ 960000 lines (To be upserted) | ~ 7GB TSV file

Commands used in sequence:

 1) hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator="|"  -Dimporttsv.columns="<Column Names (472 Columns)>" -Dimporttsv.bulk.output="<HFiles Path on HDFS>" <Table Name> <TSV file path on HDFS> 
 2) hadoop fs -chmod 777 <HFiles Path on HDFS>
 3) hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles <HFiles Path on HDFS> <Table Name>

Fixes Tried:

  1. Increasing S3 Max Connections:

    • We increased the below property but it did not seem to resolve the issue. fs.s3.maxConnections : Values tried -- 10000, 20000, 50000, 100000.
  2. HBase Repair:

    • Another approach was to execute the HBase repair command but it didn't seem to help either.
      Command : hbase hbase hbck -repair

错误跟踪如下:

[LoadIncrementalHFiles-17] mapreduce.LoadIncrementalHFiles: Received a CorruptHFileException from region server: row '00218333246' on table 'WB_MASTER' at region=WB_MASTER,00218333246,1506304894610.f108f470c00356217d63396aa11cf0bc., hostname=ip-10-244-8-74.ec2.internal,16020,1507907710216, seqNum=198 org.apache.hadoop.hbase.io.hfile.CorruptHFileException: org.apache.hadoop.hbase.io.hfile.CorruptHFileException: Problem reading HFile Trailer from file s3://wbpoc-landingzone/emrfs_test/wb_hbase_compressed/data/default/WB_MASTER/f108f470c00356217d63396aa11cf0bc/cf/2a9ecdc5c3aa4ad8aca535f56c35a32d_SeqId_200_ at org.apache.hadoop.hbase.io.hfile.HFile.pickReaderVersion(HFile.java:497) at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:525) at org.apache.hadoop.hbase.regionserver.StoreFile$Reader.(StoreFile.java:1170) at org.apache.hadoop.hbase.regionserver.StoreFileInfo.open(StoreFileInfo.java:259) at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:427) at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:528) at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:518) at org.apache.hadoop.hbase.regionserver.HStore.createStoreFileAndReader(HStore.java:667) at org.apache.hadoop.hbase.regionserver.HStore.createStoreFileAndReader(HStore.java:659) at org.apache.hadoop.hbase.regionserver.HStore.bulkLoadHFile(HStore.java:799) at org.apache.hadoop.hbase.regionserver.HRegion.bulkLoadHFiles(HRegion.java:5574) at org.apache.hadoop.hbase.regionserver.RSRpcServices.bulkLoadHFile(RSRpcServices.java:2034) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService.callBlockingMethod(ClientProtos.java:34952) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2339) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:123) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:188) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:168) Caused by: java.io.FileNotFoundException: File not present on S3 at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem$NativeS3FsInputStream.read(S3NativeFileSystem.java:203) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at java.io.DataInputStream.readFully(DataInputStream.java:195) at org.apache.hadoop.hbase.io.hfile.FixedFileTrailer.readFromStream(FixedFileTrailer.java:391) at org.apache.hadoop.hbase.io.hfile.HFile.pickReaderVersion(HFile.java:482)


对于找出造成这种差异的根本原因的任何建议都会非常有帮助。

感谢您的帮助!谢谢!

经过大量研究和反复试验,感谢 AWS 支持人员,我终于找到了解决此问题的方法。问题似乎是由于 S3 的最终一致性而发生的。 AWS 团队建议使用下面的 属性 并且效果很好,到目前为止我们还没有遇到 HFile 损坏问题。希望这对遇到同样问题的人有所帮助!

属性 (hbase-site.xml): hbase.bulkload.retries.retryOnIOException : 真