Textract 不支持的文档异常

Question

我正在尝试使用 boto3 来运行 textract detect_document_text 请求。

我正在使用以下代码：

client = boto3.client('textract')
response = client.detect_document_text(
             Document={
            'Bytes': image_b64['document_b64']
        }
      )

其中 image_b64['document_b64'] 是我转换的 base64 图片代码，例如 https://base64.guru/converter/encode/image 网站。

但我收到以下错误：

UnsupportedDocumentException

我做错了什么？

Answer 1

每个文档：

If you're using an AWS SDK to call Amazon Textract, you might not need to base64-encode image bytes passed using the Bytes field.

仅在直接调用 REST 时才需要 Base64 编码API。使用Python或NodeJS SDK时，使用native bytes（二进制字节）。

Answer 2

为了将来参考，我使用以下方法解决了该问题：

client = boto3.client('textract')
image_64_decode = base64.b64decode(image_b64['document_b64']) 
bytes = bytearray(image_64_decode)
response = client.detect_document_text(
    Document={
        'Bytes': bytes
    }
)

Answer 3

对于 Boto3，如果您使用 Jupyternotebook 处理图像（.jpg 或 .png），您可以使用：

import boto3
import cv2 
with open(images_path, "rb") as img_file:
  img_str = bytearray(img_file.read())
textract = boto3.client('textract')
response = textract.detect_document_text(Document={'Bytes': img_str})

Textract 不支持的文档异常

Textract Unsupported Document Exception

python

text-extraction

boto3

amazon-textract