AWS textract 表单设计最佳实践

Aws textract form design best practices

我目前正在重新设计文档和表单，以提高使用 Aws textract 进行提取的便利性。

您有经验和最佳做法可以分享吗？

此致

AWS Textract 使用机器学习算法从表单和表格中提取数据。总的来说，他们没有提供任何可遵循的良好做法。这个想法是，无论格式如何，他们都可以提取数据。

我的建议是进行一些手动测试。看看您当前使用的表格或文档最常见的问题是什么。检查数据是否丢失、不一致或只是错误检测，并尝试解决这些问题。然后对新表格重复相同的过程，看看是否有改进。

提高 Textract 准确性是您的唯一目标吗？如果是这样，那么您可能已经意识到存在的问题。运用这些知识。

在这种情况下，了解改进了哪些地方会非常有帮助。

了解我们所讨论的文档类型也有助于提供更好的答案。以及您正在使用什么frameworks/generators。

以下是 Amazon Textract 开发人员指南中推荐的一些最佳实践，以便 Provide an Optimal Input Document :

以下是您可以优化输入文档以获得更好结果的几种方法的列表。

Ensure that your document text is in a language that Amazon Textract supports. Currently, Amazon Textract supports English, Spanish, German, Italian, French, and Portuguese.

Provide a high quality image, ideally at least 150 DPI.

If your document is already in one of the file formats that Amazon Textract supports (PDF, TIFF, JPEG,and PNG), don't convert or downsample the document before uploading it to Amazon Textract.

为了在从文档中的表格中提取文本时获得最佳结果，请确保：

Tables in your document are visually separated from surrounding elements on the page. For example, the table isn't overlaid onto an image or complex pattern.

Text within the table is upright. For example, the text isn't rotated relative to other text on the page. When extracting text from tables, you might see inconsistent results when:

Merged table cells that span multiple columns.

Tables with cells, rows, or columns that are different from other parts of the same table.

强烈建议您查看开发人员指南。

AWS textract 表单设计最佳实践

Aws textract form design best practices

amazon-web-services

amazon-textract