Azure 表单识别器在 Databricks 上找不到 Python 的内容

Question

我正在使用相关的认知形式识别库在 Databricks 上执行以下 Python：

from azure.ai.formrecognizer import FormRecognizerClient
from azure.core.credentials import AzureKeyCredential
from azure.core.credentials import AzureKeyCredential
from azure.ai.formrecognizer import FormRecognizerClient
credential = AzureKeyCredential("aaa6123af5b843a38044538d95584c3d")
endpoint= "https://myformrecognizr.cognitiveservices.azure.com/"

form_recognizer_client = FormRecognizerClient(endpoint, credential)

with open("/dbfs/mnt/lake/RAW/export/Picturehouse.pdf", "rb") as fd:
    form = fd.read()

poller = form_recognizer_client.begin_recognize_content(form)
form_pages = poller.result()

for content in form_pages:
    for table in content.tables:
        print("Table found on page {}:".format(table.page_number))
        print("Table location {}:".format(table.bounding_box))
        for cell in table.cells:
            print("Cell text: {}".format(cell.text))
            print("Location: {}".format(cell.bounding_box))
            print("Confidence score: {}\n".format(cell.confidence))

    if content.selection_marks:
        print("Selection marks found on page {}:".format(content.page_number))
        for selection_mark in content.selection_marks:
            print("Selection mark is '{}' within bounding box '{}' and has a confidence of {}".format(
                selection_mark.state,
                selection_mark.bounding_box,
                selection_mark.confidence
            ))

pdf 格式如下所示：

图书馆承认单元格文本：项目单元格文本：数量单元格文本：座位分配单元格文本：小计单元格文本：成人单元格文本：1 单元格文本：D-11 单元格文本：14.50

但它无法识别 pdf 中的以下文本：

You can go straight to the screen by showing your e-ticket to an usher. Alternatively, you can collect your tickets at Box Office at least 15 minutes before the advertised start time of the film or event. You need your Booking Reference and/or payment card to help us find your booking. You can print this page by clicking the "Print This Page" link above.

这是设计使然吗？还是我的代码中遗漏了什么？

Answer 1

不幸的是，设计就是这样。 表单识别器 正在处理 pre-trained 模型，它可以识别文档中的 key-value 对、文本和 table，以及 table 作为输入上传的文件中的内容。即使文件中包含大量段落文字和table中间或任何地方的内容，也能被识别。

要了解更多详情，请参考这篇link:

https://www.drware.com/extract-data-from-pdfs-using-form-recognizer-with-code-or-without/

https://www.youtube.com/watch?v=iBQO4QdUp6A&t=10s

https://github.com/tomweinandy/form_recognizer_demo

Azure 表单识别器在 Databricks 上找不到 Python 的内容

Azure Form Recognizer Not Finding Content with Python on Databricks

apache-spark

pyspark

azure-cognitive-services

azure-databricks