使用 AWS Textract 将多页表解析为 CSV 文件

Question

我是一个 AWS 新手，正在尝试使用 AWS Textract 将多页文件的表格解析为 CSV 文件。我尝试使用 AWS 的示例 in this page however when we are dealing with a multi-page file the response = client.analyze_document(Document={'Bytes': bytes_test}, FeatureTypes=['TABLES']) breaks since we need asynchronous processing in those cases, as you can see in the documentation here。正确调用的函数是 client.start_document_analysis，在运行之后，它使用 client.get_document_analysis(JobId).

检索文件

所以，我使用这个逻辑而不是使用 client.analyze_document 函数修改了他们的示例，修改后的代码如下所示：

client = boto3.client('textract')

response = client.start_document_analysis(Document={'Bytes': bytes_test}, FeatureTypes=['TABLES'])

jobid=response['JobId']

jobstatus="IN_PROGRESS"
while jobstatus=="IN_PROGRESS":
    response=client.get_document_analysis(JobId=jobid)
    jobstatus=response['JobStatus']
    if jobstatus == "IN_PROGRESS": print("IN_PROGRESS")
    time.sleep(5)

但是当我运行我得到以下错误：

Traceback (most recent call last):
  File "/Users/santanna_santanna/PycharmProjects/KlooksExplore/PDFWork/textract_python_table_parser.py", line 125, in <module>
    main(file_name)
  File "/Users/santanna_santanna/PycharmProjects/KlooksExplore/PDFWork/textract_python_table_parser.py", line 112, in main
    table_csv = get_table_csv_results(file_name)
  File "/Users/santanna_santanna/PycharmProjects/KlooksExplore/PDFWork/textract_python_table_parser.py", line 62, in get_table_csv_results
    response = client.start_document_analysis(Document={'Bytes': bytes_test}, FeatureTypes=['TABLES'])
  File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/botocore/client.py", line 316, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/botocore/client.py", line 608, in _make_api_call
    api_params, operation_model, context=request_context)
  File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/botocore/client.py", line 656, in _convert_to_request_dict
    api_params, operation_model)
  File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/botocore/validate.py", line 297, in serialize_to_request
    raise ParamValidationError(report=report.generate_report())
botocore.exceptions.ParamValidationError: Parameter validation failed:
Missing required parameter in input: "DocumentLocation"
Unknown parameter in input: "Document", must be one of: DocumentLocation, FeatureTypes, ClientRequestToken, JobTag, NotificationChannel

发生这种情况是因为调用 start_document_analysis 的标准方法是使用具有这种语法的 S3 文件：

    response = client.start_document_analysis(
        DocumentLocation={
            'S3Object': {
                'Bucket': s3BucketName,
                'Name': documentName
            }
        },
        FeatureTypes=["TABLES"])

但是，如果我这样做，我将破坏 AWS example 中提出的命令行逻辑：

python textract_python_table_parser.py file.pdf.

问题是：如何调整 AWS 示例才能处理多页文件？

Answer 1

考虑使用两个不同的 lambda。一种用于调用 textract，一种用于处理结果。

请阅读此文件

https://aws.amazon.com/blogs/compute/getting-started-with-rpa-using-aws-step-functions-and-amazon-textract/

并检查这个存储库

https://github.com/aws-samples/aws-step-functions-rpa

要处理 JSON 您可以使用此示例作为参考 https://github.com/aws-samples/amazon-textract-response-parser 或者直接作为库使用。

python -m pip install amazon-textract-response-parser

使用 AWS Textract 将多页表解析为 CSV 文件

Parsing multipage tables into CSV files with AWS Textract

amazon-s3

amazon-web-services

amazon-textract