如何使用 Amazon Textract 以同步方式分析 PDF 文档？

Question

我想从我拥有的一堆 PDF 中提取表格。为此，我使用 AWS Textract Python 管道。

请指教在没有 SNS 和 SQS 的情况下我该怎么做？我希望它是同步的：为我的管道提供一个 PDF 文件，调用 AWS Textract 并获取结果。

这是我同时使用的，请指教我应该改变什么：

import boto3
import time

def startJob(s3BucketName, objectName):
    response = None
    client = boto3.client('textract')
    response = client.start_document_text_detection(
    DocumentLocation={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': objectName
        }
    })

    return response["JobId"]

def isJobComplete(jobId):
    # For production use cases, use SNS based notification 
    # Details at: https://docs.aws.amazon.com/textract/latest/dg/api-async.html
    time.sleep(5)
    client = boto3.client('textract')
    response = client.get_document_text_detection(JobId=jobId)
    status = response["JobStatus"]
    print("Job status: {}".format(status))

    while(status == "IN_PROGRESS"):
        time.sleep(5)
        response = client.get_document_text_detection(JobId=jobId)
        status = response["JobStatus"]
        print("Job status: {}".format(status))

    return status

def getJobResults(jobId):

    pages = []

    client = boto3.client('textract')
    response = client.get_document_text_detection(JobId=jobId)

    pages.append(response)
    print("Resultset page recieved: {}".format(len(pages)))
    nextToken = None
    if('NextToken' in response):
        nextToken = response['NextToken']

    while(nextToken):

        response = client.get_document_text_detection(JobId=jobId, NextToken=nextToken)

        pages.append(response)
        print("Resultset page recieved: {}".format(len(pages)))
        nextToken = None
        if('NextToken' in response):
            nextToken = response['NextToken']

    return pages

# Document
s3BucketName = "ki-textract-demo-docs"
documentName = "Amazon-Textract-Pdf.pdf"

jobId = startJob(s3BucketName, documentName)
print("Started job with id: {}".format(jobId))
if(isJobComplete(jobId)):
    response = getJobResults(jobId)

#print(response)

# Print detected text
for resultPage in response:
    for item in resultPage["Blocks"]:
        if item["BlockType"] == "LINE":
            print ('3[94m' +  item["Text"] + '3[0m')

Answer 1

目前不能直接用Textract同步处理PDF文档。来自 Textract documentation:

Amazon Textract synchronous operations (DetectDocumentText and AnalyzeDocument) support the PNG and JPEG image formats. Asynchronous operations (StartDocumentTextDetection, StartDocumentAnalysis) also support the PDF file format.

解决方法是在您的代码中，然后对这些图像使用同步 API 操作来处理文档。

如何使用 Amazon Textract 以同步方式分析 PDF 文档？

How to analyse PDF documents with Amazon Textract in a Synchronous way?

python

amazon-web-services

python-3.x

amazon-textract