AWS S3 上的 HBase HFile 损坏
HBase HFile Corruption on AWS S3
我 运行 HBase EMR 集群 (emr-5.7.0) 在 S3 上启用。
我们正在使用“ImportTsv”和“CompleteBulkLoad”用于将数据导入 HBase 的实用程序。
在我们的过程中,我们观察到间歇性地出现故障,表明某些导入文件存在 HFile corruption。这种情况偶尔会发生,并且没有我们可以推断出错误的模式。
经过大量研究并参考了博客中的许多建议,我尝试了以下修复方法但无济于事,我们仍然面临着差异。
Tech Stack :
AWS EMR Cluster (emr-5.7.0 | r3.8xlarge | 15 nodes)
AWS S3
HBase 1.3.1
Data Volume:
- ~ 960000 lines (To be upserted) | ~ 7GB TSV file
Commands used in sequence:
1) hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator="|" -Dimporttsv.columns="<Column Names (472 Columns)>" -Dimporttsv.bulk.output="<HFiles Path on HDFS>" <Table Name> <TSV file path on HDFS>
2) hadoop fs -chmod 777 <HFiles Path on HDFS>
3) hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles <HFiles Path on HDFS> <Table Name>
Fixes Tried:
Increasing S3 Max Connections:
- We increased the below property but it did not seem to resolve the issue. fs.s3.maxConnections : Values tried -- 10000, 20000, 50000, 100000.
HBase Repair:
- Another approach was to execute the HBase repair command but it didn't seem to help either.
Command : hbase hbase hbck -repair
错误跟踪如下:
[LoadIncrementalHFiles-17] mapreduce.LoadIncrementalHFiles: Received a
CorruptHFileException from region server: row '00218333246' on table
'WB_MASTER' at
region=WB_MASTER,00218333246,1506304894610.f108f470c00356217d63396aa11cf0bc.,
hostname=ip-10-244-8-74.ec2.internal,16020,1507907710216, seqNum=198
org.apache.hadoop.hbase.io.hfile.CorruptHFileException:
org.apache.hadoop.hbase.io.hfile.CorruptHFileException: Problem
reading HFile Trailer from file
s3://wbpoc-landingzone/emrfs_test/wb_hbase_compressed/data/default/WB_MASTER/f108f470c00356217d63396aa11cf0bc/cf/2a9ecdc5c3aa4ad8aca535f56c35a32d_SeqId_200_
at
org.apache.hadoop.hbase.io.hfile.HFile.pickReaderVersion(HFile.java:497)
at
org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:525)
at
org.apache.hadoop.hbase.regionserver.StoreFile$Reader.(StoreFile.java:1170)
at
org.apache.hadoop.hbase.regionserver.StoreFileInfo.open(StoreFileInfo.java:259)
at
org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:427)
at
org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:528)
at
org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:518)
at
org.apache.hadoop.hbase.regionserver.HStore.createStoreFileAndReader(HStore.java:667)
at
org.apache.hadoop.hbase.regionserver.HStore.createStoreFileAndReader(HStore.java:659)
at
org.apache.hadoop.hbase.regionserver.HStore.bulkLoadHFile(HStore.java:799)
at
org.apache.hadoop.hbase.regionserver.HRegion.bulkLoadHFiles(HRegion.java:5574)
at
org.apache.hadoop.hbase.regionserver.RSRpcServices.bulkLoadHFile(RSRpcServices.java:2034)
at
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService.callBlockingMethod(ClientProtos.java:34952)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2339)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:123)
at
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:188)
at
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:168)
Caused by: java.io.FileNotFoundException: File not present on S3 at
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem$NativeS3FsInputStream.read(S3NativeFileSystem.java:203)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at
java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at
java.io.BufferedInputStream.read(BufferedInputStream.java:345) at
java.io.DataInputStream.readFully(DataInputStream.java:195) at
org.apache.hadoop.hbase.io.hfile.FixedFileTrailer.readFromStream(FixedFileTrailer.java:391)
at
org.apache.hadoop.hbase.io.hfile.HFile.pickReaderVersion(HFile.java:482)
对于找出造成这种差异的根本原因的任何建议都会非常有帮助。
感谢您的帮助!谢谢!
经过大量研究和反复试验,感谢 AWS 支持人员,我终于找到了解决此问题的方法。问题似乎是由于 S3 的最终一致性而发生的。 AWS 团队建议使用下面的 属性 并且效果很好,到目前为止我们还没有遇到 HFile 损坏问题。希望这对遇到同样问题的人有所帮助!
属性 (hbase-site.xml):
hbase.bulkload.retries.retryOnIOException : 真
我 运行 HBase EMR 集群 (emr-5.7.0) 在 S3 上启用。 我们正在使用“ImportTsv”和“CompleteBulkLoad”用于将数据导入 HBase 的实用程序。 在我们的过程中,我们观察到间歇性地出现故障,表明某些导入文件存在 HFile corruption。这种情况偶尔会发生,并且没有我们可以推断出错误的模式。
经过大量研究并参考了博客中的许多建议,我尝试了以下修复方法但无济于事,我们仍然面临着差异。
Tech Stack :
AWS EMR Cluster (emr-5.7.0 | r3.8xlarge | 15 nodes)
AWS S3
HBase 1.3.1
Data Volume:
- ~ 960000 lines (To be upserted) | ~ 7GB TSV file
Commands used in sequence:
1) hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator="|" -Dimporttsv.columns="<Column Names (472 Columns)>" -Dimporttsv.bulk.output="<HFiles Path on HDFS>" <Table Name> <TSV file path on HDFS> 2) hadoop fs -chmod 777 <HFiles Path on HDFS> 3) hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles <HFiles Path on HDFS> <Table Name>
Fixes Tried:
Increasing S3 Max Connections:
- We increased the below property but it did not seem to resolve the issue. fs.s3.maxConnections : Values tried -- 10000, 20000, 50000, 100000.
HBase Repair:
- Another approach was to execute the HBase repair command but it didn't seem to help either.
Command : hbase hbase hbck -repair
错误跟踪如下:
[LoadIncrementalHFiles-17] mapreduce.LoadIncrementalHFiles: Received a CorruptHFileException from region server: row '00218333246' on table 'WB_MASTER' at region=WB_MASTER,00218333246,1506304894610.f108f470c00356217d63396aa11cf0bc., hostname=ip-10-244-8-74.ec2.internal,16020,1507907710216, seqNum=198 org.apache.hadoop.hbase.io.hfile.CorruptHFileException: org.apache.hadoop.hbase.io.hfile.CorruptHFileException: Problem reading HFile Trailer from file s3://wbpoc-landingzone/emrfs_test/wb_hbase_compressed/data/default/WB_MASTER/f108f470c00356217d63396aa11cf0bc/cf/2a9ecdc5c3aa4ad8aca535f56c35a32d_SeqId_200_ at org.apache.hadoop.hbase.io.hfile.HFile.pickReaderVersion(HFile.java:497) at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:525) at org.apache.hadoop.hbase.regionserver.StoreFile$Reader.(StoreFile.java:1170) at org.apache.hadoop.hbase.regionserver.StoreFileInfo.open(StoreFileInfo.java:259) at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:427) at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:528) at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:518) at org.apache.hadoop.hbase.regionserver.HStore.createStoreFileAndReader(HStore.java:667) at org.apache.hadoop.hbase.regionserver.HStore.createStoreFileAndReader(HStore.java:659) at org.apache.hadoop.hbase.regionserver.HStore.bulkLoadHFile(HStore.java:799) at org.apache.hadoop.hbase.regionserver.HRegion.bulkLoadHFiles(HRegion.java:5574) at org.apache.hadoop.hbase.regionserver.RSRpcServices.bulkLoadHFile(RSRpcServices.java:2034) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService.callBlockingMethod(ClientProtos.java:34952) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2339) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:123) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:188) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:168) Caused by: java.io.FileNotFoundException: File not present on S3 at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem$NativeS3FsInputStream.read(S3NativeFileSystem.java:203) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at java.io.DataInputStream.readFully(DataInputStream.java:195) at org.apache.hadoop.hbase.io.hfile.FixedFileTrailer.readFromStream(FixedFileTrailer.java:391) at org.apache.hadoop.hbase.io.hfile.HFile.pickReaderVersion(HFile.java:482)
对于找出造成这种差异的根本原因的任何建议都会非常有帮助。
感谢您的帮助!谢谢!
经过大量研究和反复试验,感谢 AWS 支持人员,我终于找到了解决此问题的方法。问题似乎是由于 S3 的最终一致性而发生的。 AWS 团队建议使用下面的 属性 并且效果很好,到目前为止我们还没有遇到 HFile 损坏问题。希望这对遇到同样问题的人有所帮助!
属性 (hbase-site.xml): hbase.bulkload.retries.retryOnIOException : 真