AWS Textract - UnsupportedDocumentException - PDF
AWS Textract - UnsupportedDocumentException - PDF
我正在使用 boto3(python 的 aws sdk)分析文档(pdf)以获得 key:value 对的形式。
import boto3
def process_text_analysis(bucket, document):
# Get the document from S3
s3_connection = boto3.resource('s3')
s3_object = s3_connection.Object(bucket, document)
s3_response = s3_object.get()
# Analyze the document
client = boto3.client('textract')
response = client.analyze_document(Document={'S3Object': {'Bucket': bucket, 'Name': document}},
FeatureTypes=["FORMS"])
process_text_analysis('francismorgan-01', '709 Privado M SURESTE.pdf')
我使用 Analyze Document 遵循了 AWS 的文档,当我 运行 我的函数出现错误时。
botocore.errorfactory.UnsupportedDocumentException: An error occurred (UnsupportedDocumentException) when calling the AnalyzeDocument operation: Request has unsupported document format
我错过了什么吗?
AnalyzeDocument 是只支持 PNG 或 JPG 图片的同步 API。
既然您想使用 PDF 文件,那么您将需要使用 Amazon Textract Asynchronous API 例如 StartDocumentAnalysis, StartDocumentTextDetection
如docs所说
StartDocumentAnalysis can analyze text in documents that are in JPEG, PNG, TIFF, and PDF format. The documents are stored in an Amazon S3 bucket. Use DocumentLocation to specify the bucket name and file name of the document.
Boto3 示例
import boto3
client = boto3.client('textract')
response = client.start_document_analysis(
DocumentLocation={
'S3Object': {
'Bucket': 'YOUR_BUCKET_NAME',
'Name': 'YOUR_FILE_KEY_NAME'
}
},
FeatureTypes=["FORMS"]
)
# Get results from asynchronous operation
result = client.get_document_analysis(JobId=response['JobId'])
此外,AWS 文档提供了 class TextractWrapper 方法 start_analysis_job
和 get_analysis_job
来执行与前面示例相同的操作。
我正在使用 boto3(python 的 aws sdk)分析文档(pdf)以获得 key:value 对的形式。
import boto3
def process_text_analysis(bucket, document):
# Get the document from S3
s3_connection = boto3.resource('s3')
s3_object = s3_connection.Object(bucket, document)
s3_response = s3_object.get()
# Analyze the document
client = boto3.client('textract')
response = client.analyze_document(Document={'S3Object': {'Bucket': bucket, 'Name': document}},
FeatureTypes=["FORMS"])
process_text_analysis('francismorgan-01', '709 Privado M SURESTE.pdf')
我使用 Analyze Document 遵循了 AWS 的文档,当我 运行 我的函数出现错误时。
botocore.errorfactory.UnsupportedDocumentException: An error occurred (UnsupportedDocumentException) when calling the AnalyzeDocument operation: Request has unsupported document format
我错过了什么吗?
AnalyzeDocument 是只支持 PNG 或 JPG 图片的同步 API。
既然您想使用 PDF 文件,那么您将需要使用 Amazon Textract Asynchronous API 例如 StartDocumentAnalysis, StartDocumentTextDetection
如docs所说
StartDocumentAnalysis can analyze text in documents that are in JPEG, PNG, TIFF, and PDF format. The documents are stored in an Amazon S3 bucket. Use DocumentLocation to specify the bucket name and file name of the document.
Boto3 示例
import boto3
client = boto3.client('textract')
response = client.start_document_analysis(
DocumentLocation={
'S3Object': {
'Bucket': 'YOUR_BUCKET_NAME',
'Name': 'YOUR_FILE_KEY_NAME'
}
},
FeatureTypes=["FORMS"]
)
# Get results from asynchronous operation
result = client.get_document_analysis(JobId=response['JobId'])
此外,AWS 文档提供了 class TextractWrapper 方法 start_analysis_job
和 get_analysis_job
来执行与前面示例相同的操作。