AWS textract - UnsupportedDocumentException

Question

使用 boto3 为 python 实施 aws textract 时。

代码：

import boto3

# Document
documentName = "/home/niranjan/IdeaProjects/amazon-forecast-samples/notebooks/basic/Tutorial/cert.pdf"

# Read document content
with open(documentName, 'rb') as document:
    imageBytes = bytearray(document.read())

print(type(imageBytes))

# Amazon Textract client
textract = boto3.client('textract', region_name='us-west-2')

# Call Amazon Textract
response = textract.detect_document_text(Document={'Bytes': imageBytes})

下面是 aws 的凭证和配置文件

niranjan@niranjan:~$ cat ~/.aws/credentials
[default]
aws_access_key_id=my_access_key_id
aws_secret_access_key=my_secret_access_key

niranjan@niranjan:~$ cat ~/.aws/config 
[default]
region=eu-west-1

我遇到了这个异常：

---------------------------------------------------------------------------
UnsupportedDocumentException              Traceback (most recent call last)
<ipython-input-11-f52c10e3f3db> in <module>
     14 
     15 # Call Amazon Textract
---> 16 response = textract.detect_document_text(Document={'Bytes': imageBytes})
     17 
     18 #print(response)

~/venv/lib/python3.7/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
    314                     "%s() only accepts keyword arguments." % py_operation_name)
    315             # The "self" in this scope is referring to the BaseClient.
--> 316             return self._make_api_call(operation_name, kwargs)
    317 
    318         _api_call.__name__ = str(py_operation_name)

~/venv/lib/python3.7/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
    624             error_code = parsed_response.get("Error", {}).get("Code")
    625             error_class = self.exceptions.from_code(error_code)
--> 626             raise error_class(parsed_response, operation_name)
    627         else:
    628             return parsed_response

UnsupportedDocumentException: An error occurred (UnsupportedDocumentException) when calling the DetectDocumentText operation: Request has unsupported document format

我对 AWS textract 有点陌生，非常感谢任何帮助。

Answer 1

由于Textract的DetectDocumentTextAPI不支持"pdf"类型的文档，发送pdf遇到UnsupportedDocumentFormat Exception。尝试发送图片文件。

如果您仍想发送 pdf 文件，则必须使用 Textract 的异步 APIs。例如。 StartDocumentAnalysis API 开始分析，GetDocumentAnalysis 获取分析文档。

Detects text in the input document. Amazon Textract can detect lines of text and the words that make up a line of text. The input document must be an image in JPEG or PNG format. DetectDocumentText returns the detected text in an array of Block objects.

https://docs.aws.amazon.com/textract/latest/dg/API_DetectDocumentText.html

Answer 2

import boto3
import time

def startJob(s3BucketName, objectName):
    response = None
    client = boto3.client('textract')
    response = client.start_document_text_detection(
    DocumentLocation={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': objectName
        }
    })

    return response["JobId"]

def isJobComplete(jobId):
    # For production use cases, use SNS based notification 
    # Details at: https://docs.aws.amazon.com/textract/latest/dg/api-async.html
    time.sleep(5)
    client = boto3.client('textract')
    response = client.get_document_text_detection(JobId=jobId)
    status = response["JobStatus"]
    print("Job status: {}".format(status))

    while(status == "IN_PROGRESS"):
        time.sleep(5)
        response = client.get_document_text_detection(JobId=jobId)
        status = response["JobStatus"]
        print("Job status: {}".format(status))

    return status

def getJobResults(jobId):

    pages = []

    client = boto3.client('textract')
    response = client.get_document_text_detection(JobId=jobId)
    
    pages.append(response)
    print("Resultset page recieved: {}".format(len(pages)))
    nextToken = None
    if('NextToken' in response):
        nextToken = response['NextToken']

    while(nextToken):

        response = client.get_document_text_detection(JobId=jobId, NextToken=nextToken)

        pages.append(response)
        print("Resultset page recieved: {}".format(len(pages)))
        nextToken = None
        if('NextToken' in response):
            nextToken = response['NextToken']

    return pages

# Document
s3BucketName = "ki-textract-demo-docs"
documentName = "Amazon-Textract-Pdf.pdf"

jobId = startJob(s3BucketName, documentName)
print("Started job with id: {}".format(jobId))
if(isJobComplete(jobId)):
    response = getJobResults(jobId)

#print(response)

# Print detected text
for resultPage in response:
    for item in resultPage["Blocks"]:
        if item["BlockType"] == "LINE":
            print ('3[94m' +  item["Text"] + '3[0m')

试试这个代码并参考这个 link 来自 AWS 的解释

AWS textract - UnsupportedDocumentException

AWS textract - UnsupportedDocumentException

amazon-web-services

amazon-textract