处理来自 GCS 的文档时,文档 AI 处理文档因参数无效而失败

Document AI process document fails with invalid argument when processing docs from GCS

我收到与以下非常相似的错误,但我不在欧盟:

当我使用 raw_document 处理本地 pdf 文件时,它工作正常。但是,当我在 GCS 位置指定 pdf 文件时,它失败了。

错误信息:

the processor name: projects/xxxxxxxxx/locations/us/processors/f7502cad4bccdd97
the form process request: name: "projects/xxxxxxxxx/locations/us/processors/f7502cad4bccdd97"
inline_document {
  uri: "gs://xxxx/temp/test1.pdf"
}

Traceback (most recent call last):
  File "C:\Python39\lib\site-packages\google\api_core\grpc_helpers.py", line 66, in error_remapped_callable
    return callable_(*args, **kwargs)
  File "C:\Python39\lib\site-packages\grpc\_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "C:\Python39\lib\site-packages\grpc\_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.INVALID_ARGUMENT
        details = "Request contains an invalid argument."
        debug_error_string = "{"created":"@1647296055.582000000","description":"Error received from peer ipv4:142.250.80.74:443","file":"src/core/lib/surface/call.cc","file_line":1070,"grpc_message":"Request contains an invalid argument.","grpc_status":3}"
>

代码:

   client = documentai.DocumentProcessorServiceClient(client_options=opts)

    # The full resource name of the processor, e.g.:
    # projects/project-id/locations/location/processor/processor-id
    # You must create new processors in the Cloud Console first
    name = f"projects/{project_id}/locations/{location}/processors/{processor_id}"
    print(f'the processor name: {name}')

    # document = {"uri": gcs_path, "mime_type": "application/pdf"}
    document = {"uri": gcs_path}
    inline_document = documentai.Document()
    inline_document.uri = gcs_path
    # inline_document.mime_type = "application/pdf"

    # Configure the process request
    # request = {"name": name, "inline_document": document}
    request = documentai.ProcessRequest(
        inline_document=inline_document,
        name=name
    )    

    print(f'the form process request: {request}')

    result = client.process_document(request=request)

我认为我在存储桶上没有权限问题,因为相同的设置适用于同一存储桶上的文档分类过程。

这是 Document AI 的已知问题,已在 issue tracker 中报告。不幸的是,目前唯一的解决方法是:

  1. 下载您的文件,以字节形式读取文件并使用 process_documents(). See Document AI local processing 作为示例代码。
  2. 使用 batch_process_documents() 因为默认情况下只接受来自 GCS 的文件。如果您不想在下载文件时执行额外的步骤。