在本地使用 Textract 进行 OCR

Question

我想使用 Python 从图像中提取文本。（Tessaract lib 对我不起作用，因为它需要安装）。

我找到了 boto3 库和 Textract，但我在使用它时遇到了问题。我对此还是陌生的。你能告诉我我需要做什么才能正确运行我的脚本吗？

这是我的代码：

import cv2
import boto3
import textract


#img = cv2.imread('slika2.jpg') #this is jpg file
with open('slika2.pdf', 'rb') as document:
    img = bytearray(document.read())

textract = boto3.client('textract',region_name='us-west-2')

response = textract.detect_document_text(Document={'Bytes': img}). #gives me error
print(response)

当我运行这段代码时，我得到：

botocore.exceptions.ClientError: An error occurred (InvalidSignatureException) when calling the DetectDocumentText operation: The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method. Consult the service documentation for details.

我也试过这个：

# Document
documentName = "slika2.jpg"

# Read document content
with open(documentName, 'rb') as document:
    imageBytes = bytearray(document.read())

# Amazon Textract client
textract = boto3.client('textract',region_name='us-west-2')

# Call Amazon Textract
response = textract.detect_document_text(Document={'Bytes': imageBytes}) #ERROR

#print(response)

# Print detected text
for item in response["Blocks"]:
    if item["BlockType"] == "LINE":
        print ('3[94m' +  item["Text"] + '3[0m')

但是我得到这个错误：

botocore.exceptions.ClientError: An error occurred (InvalidSignatureException) when calling the DetectDocumentText operation: The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method. Consult the service documentation for details.

我是这方面的菜鸟，所以任何帮助都会很好。如何从我的图像或 pdf 文件中读取文本？

我也加了这段代码，但是还是报错Unable to locate credentials。

session = boto3.Session(
    aws_access_key_id='xxxxxxxxxxxx',
    aws_secret_access_key='yyyyyyyyyyyyyyyyyyyyy'
)

Answer 1

将凭据传递给 boto3 时出现问题。您必须在创建 boto3 客户端时传递凭据。

import boto3

# boto3 client
client = boto3.client(
    'textract', 
    region_name='us-west-2', 
    aws_access_key_id='xxxxxxx', 
    aws_secret_access_key='xxxxxxx'
)

# Read image
with open('slika2.png', 'rb') as document:
    img = bytearray(document.read())

# Call Amazon Textract
response = client.detect_document_text(
    Document={'Bytes': img}
)

# Print detected text
for item in response["Blocks"]:
    if item["BlockType"] == "LINE":
        print ('3[94m' +  item["Text"] + '3[0m')

请注意，不建议在代码中对 AWS 密钥进行硬编码。请参考以下文档

https://boto3.amazonaws.com/v1/documentation/api/1.9.42/guide/configuration.html

在本地使用 Textract 进行 OCR

Using Textract for OCR locally

python

amazon-web-services

amazon-textract