使用 AWS Textract 将多页表解析为 CSV 文件

Parsing multipage tables into CSV files with AWS Textract

我是一个 AWS 新手,正在尝试使用 AWS Textract 将多页文件的表格解析为 CSV 文件。 我尝试使用 AWS 的示例 in this page however when we are dealing with a multi-page file the response = client.analyze_document(Document={'Bytes': bytes_test}, FeatureTypes=['TABLES']) breaks since we need asynchronous processing in those cases, as you can see in the documentation here。正确调用的函数是 client.start_document_analysis,在 运行 之后,它使用 client.get_document_analysis(JobId).

检索文件

所以,我使用这个逻辑而不是使用 client.analyze_document 函数修改了他们的示例,修改后的代码如下所示:

client = boto3.client('textract')

response = client.start_document_analysis(Document={'Bytes': bytes_test}, FeatureTypes=['TABLES'])

jobid=response['JobId']

jobstatus="IN_PROGRESS"
while jobstatus=="IN_PROGRESS":
    response=client.get_document_analysis(JobId=jobid)
    jobstatus=response['JobStatus']
    if jobstatus == "IN_PROGRESS": print("IN_PROGRESS")
    time.sleep(5)

但是当我 运行 我得到以下错误:

Traceback (most recent call last):
  File "/Users/santanna_santanna/PycharmProjects/KlooksExplore/PDFWork/textract_python_table_parser.py", line 125, in <module>
    main(file_name)
  File "/Users/santanna_santanna/PycharmProjects/KlooksExplore/PDFWork/textract_python_table_parser.py", line 112, in main
    table_csv = get_table_csv_results(file_name)
  File "/Users/santanna_santanna/PycharmProjects/KlooksExplore/PDFWork/textract_python_table_parser.py", line 62, in get_table_csv_results
    response = client.start_document_analysis(Document={'Bytes': bytes_test}, FeatureTypes=['TABLES'])
  File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/botocore/client.py", line 316, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/botocore/client.py", line 608, in _make_api_call
    api_params, operation_model, context=request_context)
  File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/botocore/client.py", line 656, in _convert_to_request_dict
    api_params, operation_model)
  File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/botocore/validate.py", line 297, in serialize_to_request
    raise ParamValidationError(report=report.generate_report())
botocore.exceptions.ParamValidationError: Parameter validation failed:
Missing required parameter in input: "DocumentLocation"
Unknown parameter in input: "Document", must be one of: DocumentLocation, FeatureTypes, ClientRequestToken, JobTag, NotificationChannel

发生这种情况是因为调用 start_document_analysis 的标准方法是使用具有这种语法的 S3 文件:

    response = client.start_document_analysis(
        DocumentLocation={
            'S3Object': {
                'Bucket': s3BucketName,
                'Name': documentName
            }
        },
        FeatureTypes=["TABLES"])

但是,如果我这样做,我将破坏 AWS example 中提出的命令行逻辑:

python textract_python_table_parser.py file.pdf.

问题是:如何调整 AWS 示例才能处理多页文件?

考虑使用两个不同的 lambda。一种用于调用 textract,一种用于处理结果。

请阅读此文件

https://aws.amazon.com/blogs/compute/getting-started-with-rpa-using-aws-step-functions-and-amazon-textract/

并检查这个存储库

https://github.com/aws-samples/aws-step-functions-rpa

要处理 JSON 您可以使用此示例作为参考 https://github.com/aws-samples/amazon-textract-response-parser 或者直接作为库使用。

python -m pip install amazon-textract-response-parser