Azure 表单识别器在 Databricks 上找不到 Python 的内容
Azure Form Recognizer Not Finding Content with Python on Databricks
我正在使用相关的认知形式识别库在 Databricks 上执行以下 Python:
from azure.ai.formrecognizer import FormRecognizerClient
from azure.core.credentials import AzureKeyCredential
from azure.core.credentials import AzureKeyCredential
from azure.ai.formrecognizer import FormRecognizerClient
credential = AzureKeyCredential("aaa6123af5b843a38044538d95584c3d")
endpoint= "https://myformrecognizr.cognitiveservices.azure.com/"
form_recognizer_client = FormRecognizerClient(endpoint, credential)
with open("/dbfs/mnt/lake/RAW/export/Picturehouse.pdf", "rb") as fd:
form = fd.read()
poller = form_recognizer_client.begin_recognize_content(form)
form_pages = poller.result()
for content in form_pages:
for table in content.tables:
print("Table found on page {}:".format(table.page_number))
print("Table location {}:".format(table.bounding_box))
for cell in table.cells:
print("Cell text: {}".format(cell.text))
print("Location: {}".format(cell.bounding_box))
print("Confidence score: {}\n".format(cell.confidence))
if content.selection_marks:
print("Selection marks found on page {}:".format(content.page_number))
for selection_mark in content.selection_marks:
print("Selection mark is '{}' within bounding box '{}' and has a confidence of {}".format(
selection_mark.state,
selection_mark.bounding_box,
selection_mark.confidence
))
pdf 格式如下所示:
图书馆承认
单元格文本:项目
单元格文本:数量
单元格文本:座位分配
单元格文本:小计
单元格文本:成人
单元格文本:1
单元格文本:D-11
单元格文本:14.50
但它无法识别 pdf 中的以下文本:
You can go straight to the screen by showing your e-ticket to an
usher. Alternatively, you can collect your tickets at Box Office at
least 15 minutes before the advertised start time of the film or
event. You need your Booking Reference and/or payment card to help us
find your booking. You can print this page by clicking the "Print This
Page" link above.
这是设计使然吗?还是我的代码中遗漏了什么?
不幸的是,设计就是这样。 表单识别器 正在处理 pre-trained 模型,它可以识别文档中的 key-value 对、文本和 table,以及 table 作为输入上传的文件中的内容。即使文件中包含大量段落文字和table中间或任何地方的内容,也能被识别。
要了解更多详情,请参考这篇link:
https://www.drware.com/extract-data-from-pdfs-using-form-recognizer-with-code-or-without/
我正在使用相关的认知形式识别库在 Databricks 上执行以下 Python:
from azure.ai.formrecognizer import FormRecognizerClient
from azure.core.credentials import AzureKeyCredential
from azure.core.credentials import AzureKeyCredential
from azure.ai.formrecognizer import FormRecognizerClient
credential = AzureKeyCredential("aaa6123af5b843a38044538d95584c3d")
endpoint= "https://myformrecognizr.cognitiveservices.azure.com/"
form_recognizer_client = FormRecognizerClient(endpoint, credential)
with open("/dbfs/mnt/lake/RAW/export/Picturehouse.pdf", "rb") as fd:
form = fd.read()
poller = form_recognizer_client.begin_recognize_content(form)
form_pages = poller.result()
for content in form_pages:
for table in content.tables:
print("Table found on page {}:".format(table.page_number))
print("Table location {}:".format(table.bounding_box))
for cell in table.cells:
print("Cell text: {}".format(cell.text))
print("Location: {}".format(cell.bounding_box))
print("Confidence score: {}\n".format(cell.confidence))
if content.selection_marks:
print("Selection marks found on page {}:".format(content.page_number))
for selection_mark in content.selection_marks:
print("Selection mark is '{}' within bounding box '{}' and has a confidence of {}".format(
selection_mark.state,
selection_mark.bounding_box,
selection_mark.confidence
))
pdf 格式如下所示:
图书馆承认 单元格文本:项目 单元格文本:数量 单元格文本:座位分配 单元格文本:小计 单元格文本:成人 单元格文本:1 单元格文本:D-11 单元格文本:14.50
但它无法识别 pdf 中的以下文本:
You can go straight to the screen by showing your e-ticket to an usher. Alternatively, you can collect your tickets at Box Office at least 15 minutes before the advertised start time of the film or event. You need your Booking Reference and/or payment card to help us find your booking. You can print this page by clicking the "Print This Page" link above.
这是设计使然吗?还是我的代码中遗漏了什么?
不幸的是,设计就是这样。 表单识别器 正在处理 pre-trained 模型,它可以识别文档中的 key-value 对、文本和 table,以及 table 作为输入上传的文件中的内容。即使文件中包含大量段落文字和table中间或任何地方的内容,也能被识别。
要了解更多详情,请参考这篇link:
https://www.drware.com/extract-data-from-pdfs-using-form-recognizer-with-code-or-without/