PDF OCR 如何批量处理文件？

Question

我想使用 Google OCR 异步批量处理 20000 个 PDF，但我没有找到与之相关的文档，我已经尝试使用 client.asyncBatchAnnotateFilesAsync 函数；

List<AsyncAnnotateFileRequest> requests = new ArrayList<>();
for (MultipartFile file : files) {
    GcsSource gcsSource = GcsSource.newBuilder().setUri(gcsSourcePath + file.getOriginalFilename()).build();
    InputConfig inputConfig = InputConfig.newBuilder().setMimeType("application/pdf").setGcsSource(gcsSource)
            .build();
    GcsDestination gcsDestination = GcsDestination.newBuilder()
            .setUri(gcsDestinationPath + file.getOriginalFilename()).build();
    OutputConfig outputConfig = OutputConfig.newBuilder().setBatchSize(2).setGcsDestination(gcsDestination)
            .build();
    AsyncAnnotateFileRequest request = AsyncAnnotateFileRequest.newBuilder().addFeatures(feature)
            .setInputConfig(inputConfig).setOutputConfig(outputConfig).build();
    requests.add(request);

}
AsyncBatchAnnotateFilesRequest request = AsyncBatchAnnotateFilesRequest.newBuilder().addAllRequests(requests)
        .build();
AsyncBatchAnnotateFilesResponse response = client.asyncBatchAnnotateFilesAsync(request).get();
System.out.println("Waiting for the operation to finish.");

但是我得到的是一条错误消息

io.grpc.StatusRuntimeException: INVALID_ARGUMENT: At this time, only single requests are supported for asynchronous processing.

如果 google 不提供批处理，为什么他们提供 asyncBatchAnnotateFilesAsync？也许我使用的是旧版本？ asyncBatchAnnotateFilesAsync 函数是否适用于其他 beta 版本？

Answer 1

Vision 服务不支持一次调用的多个请求。

这可能会造成混淆，因为根据 RPC API documentation you could indeed provide multiple requests on a single service call (1 file per request), still, according to this issue tracker Vision 服务存在已知限制，目前它一次只能接受一个请求。

Answer 2

由于每个请求只能发送 1 个文件，您可以只发送 20k 个请求吗？它们是异步请求，因此发送它们应该非常快。

PDF OCR 如何批量处理文件？

How to process files in batch with PDF OCR?

java

google-cloud-storage

google-cloud-platform

google-cloud-vision