AWS Textract - UnsupportedDocumentException - PDF

AWS Textract - UnsupportedDocumentException - PDF

我正在使用 boto3(python 的 aws sdk)分析文档(pdf)以获得 key:value 对的形式。

import boto3

def process_text_analysis(bucket, document):
    # Get the document from S3
    s3_connection = boto3.resource('s3')
    s3_object = s3_connection.Object(bucket, document)
    s3_response = s3_object.get()
    # Analyze the document
    client = boto3.client('textract')
    response = client.analyze_document(Document={'S3Object': {'Bucket': bucket, 'Name': document}},
                                       FeatureTypes=["FORMS"])


process_text_analysis('francismorgan-01', '709 Privado M SURESTE.pdf')

我使用 Analyze Document 遵循了 AWS 的文档,当我 运行 我的函数出现错误时。

botocore.errorfactory.UnsupportedDocumentException: An error occurred (UnsupportedDocumentException) when calling the AnalyzeDocument operation: Request has unsupported document format

我错过了什么吗?

AnalyzeDocument 是只支持 PNG 或 JPG 图片的同步 API。

既然您想使用 PDF 文件,那么您将需要使用 Amazon Textract Asynchronous API 例如 StartDocumentAnalysis, StartDocumentTextDetection

docs所说

StartDocumentAnalysis can analyze text in documents that are in JPEG, PNG, TIFF, and PDF format. The documents are stored in an Amazon S3 bucket. Use DocumentLocation to specify the bucket name and file name of the document.

Boto3 示例

import boto3

client = boto3.client('textract')

response = client.start_document_analysis(
    DocumentLocation={
        'S3Object': {
            'Bucket': 'YOUR_BUCKET_NAME',
            'Name': 'YOUR_FILE_KEY_NAME'
        }
    },
    FeatureTypes=["FORMS"]
)

# Get results from asynchronous operation
result = client.get_document_analysis(JobId=response['JobId'])

此外,AWS 文档提供了 class TextractWrapper 方法 start_analysis_jobget_analysis_job 来执行与前面示例相同的操作。