无法通过 python happybase - HDP 3 在 Hbase 中上传大小超过 10MB 的 pdf 文件

unable to upload pdf files of size more than 10MB in Hbase via python happybase - HDP 3

我们正在使用 HDP 3。我们正在尝试将 PDF 文件插入 Hbase table 中特定列族的列之一。开发环境是python3.6,hbase connector是happybase 1.1.0。

我们无法在 hbase 中上传任何大于 10 MB 的 PDF 文件。

在hbase中我们设置了如下参数:

我们收到以下错误:

IOError(message=b'org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 1 action: org.apache.hadoop.hbase.DoNotRetryIOException: Cell with size 80941994 exceeds limit of 10485760 bytes\n\tat org.apache.hadoop.hbase.regionserver.RSRpcServices.checkCellSizeLimit(RSRpcServices.java:937)\n\tat org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:1010)\n\tat org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicBatchOp(RSRpcServices.java:959)\n\tat org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:922)\n\tat org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2683)\n\tat org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService.callBlockingMethod(ClientProtos.java:42014)\n\tat org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409)\n\tat org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:131)\n\tat org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)\n\tat

您必须检查 hbase source code 以了解发生了什么:

private void checkCellSizeLimit(final HRegion r, final Mutation m) throws IOException {
    945    if (r.maxCellSize > 0) {
    946      CellScanner cells = m.cellScanner();
    947      while (cells.advance()) {
    948        int size = PrivateCellUtil.estimatedSerializedSizeOf(cells.current());
    949        if (size > r.maxCellSize) {
    950          String msg = "Cell with size " + size + " exceeds limit of " + r.maxCellSize + " bytes";
    951          if (LOG.isDebugEnabled()) {
    952            LOG.debug(msg);
    953          }
    954          throw new DoNotRetryIOException(msg);
    955        }
    956      }
    957    }
    958  }

根据错误消息,您超出了 r.maxCellSize

上面注意:函数PrivateCellUtil.estimatedSerializedSizeOf已弃用,将在以后的版本中删除。

这是它的描述:

Estimate based on keyvalue's serialization format in the RPC layer. Note that there is an extra SIZEOF_INT added to the size here that indicates the actual length of the cell for cases where cell's are serialized in a contiguous format (For eg in RPCs).

您必须检查设置的值在哪里。 首先检查 HRegion.java

处的 "ordinary" 值

this.maxCellSize = conf.getLong(HBASE_MAX_CELL_SIZE_KEY, DEFAULT_MAX_CELL_SIZE);

所以可能有一个HBASE_MAX_CELL_SIZE_KEYDEFAULT_MAX_CELL_SIZE限制somewhere

public static final String HBASE_MAX_CELL_SIZE_KEY = "hbase.server.keyvalue.maxsize";
public static final int DEFAULT_MAX_CELL_SIZE = 10485760;

这里有您的 10485760 限制,它显示在您的错误消息中。如果您需要,您可以尝试将此限制提高到您的限制值。我建议在使用它之前对其进行适当的测试(限制可能有一些原因)。

编辑: 添加有关如何更改 base.server.keyvalue.maxsize 值的信息。检查 config.files:

您可以在哪里阅读:

hbase.client.keyvalue.maxsize (Description)

Specifies the combined maximum allowed size of a KeyValue instance. This is to set an upper boundary for a single entry saved in a storage file. Since they cannot be split it helps avoiding that a region cannot be split any further because the data is too large. It seems wise to set this to a fraction of the maximum region size. Setting it to zero or less disables the check. Default

10485760

hbase.server.keyvalue.maxsize (Description)

Maximum allowed size of an individual cell, inclusive of value and all key components. A value of 0 or less disables the check. The default value is 10MB. This is a safety setting to protect the server from OOM situations. Default

10485760