Extract data from PDFs at scale with form recognizer: HttpResponseError: (FailedToDownloadImage) Failed to download image from input URL on Databricks
Extract data from PDFs at scale with form recognizer: HttpResponseError: (FailedToDownloadImage) Failed to download image from input URL on Databricks
我正在尝试使用 Azure 表单识别器从 pdf 中大规模提取数据。我正在使用 code example at github
我输入的代码如下:
import pandas as pd
field_list = ["InvoiceId", "VendorName", "VendorAddress", "CustomerName", "CustomerAddress", "CustomerAddressRecipient", "InvoiceDate", "InvoiceTotal", "DueDate"]
df = pd.DataFrame(columns=field_list)
for blob in container.list_blobs():
blob_url = container_url + "/" + blob.name
poller = form_recognizer_client.begin_recognize_invoices_from_url(invoice_url=blob_url)
invoices = poller.result()
print("Scanning " + blob.name + "...")
for idx, invoice in enumerate(invoices):
single_df = pd.DataFrame(columns=field_list)
for field in field_list:
entry = invoice.fields.get(field)
if entry:
single_df[field] = [entry.value]
single_df['FileName'] = blob.name
df = df.append(single_df)
df = df.reset_index(drop=True)
df
但是,我不断收到以下错误:
HttpResponseError: (FailedToDownloadImage) Failed to download image from input URL.
我的 URL 如下所示:
https://blobpretbiukblbdev.blob.core.windows.net/demo?sp=racwdl&st=2022-05-21T19:39:07Z&se=2022-05-22T03:39:07Z&sv=2020-08-04&sr=c&sig=XYhdecG2jKF8aNPPpkcP%2FCGVVRKYTFPrOQYdNDsASCA%3D/pdf1.pdf
注意:
密钥已重新生成,我刚刚将密钥留在其中,因为它会出现在我的代码中以供说明。
我哪里可能出错了?
如REST API supportive documentation, there is a need to specify the Content-Type. There is a need to set the public access to source via JSON file. Set the Content-Type to application/pdf
. To make this work, there is a need to install filetype
package using link
所述
pip install filetype
检查此 link 以更好地实现 REST API 到用户表单识别器 SDK。
我正在尝试使用 Azure 表单识别器从 pdf 中大规模提取数据。我正在使用 code example at github
我输入的代码如下:
import pandas as pd
field_list = ["InvoiceId", "VendorName", "VendorAddress", "CustomerName", "CustomerAddress", "CustomerAddressRecipient", "InvoiceDate", "InvoiceTotal", "DueDate"]
df = pd.DataFrame(columns=field_list)
for blob in container.list_blobs():
blob_url = container_url + "/" + blob.name
poller = form_recognizer_client.begin_recognize_invoices_from_url(invoice_url=blob_url)
invoices = poller.result()
print("Scanning " + blob.name + "...")
for idx, invoice in enumerate(invoices):
single_df = pd.DataFrame(columns=field_list)
for field in field_list:
entry = invoice.fields.get(field)
if entry:
single_df[field] = [entry.value]
single_df['FileName'] = blob.name
df = df.append(single_df)
df = df.reset_index(drop=True)
df
但是,我不断收到以下错误:
HttpResponseError: (FailedToDownloadImage) Failed to download image from input URL.
我的 URL 如下所示:
https://blobpretbiukblbdev.blob.core.windows.net/demo?sp=racwdl&st=2022-05-21T19:39:07Z&se=2022-05-22T03:39:07Z&sv=2020-08-04&sr=c&sig=XYhdecG2jKF8aNPPpkcP%2FCGVVRKYTFPrOQYdNDsASCA%3D/pdf1.pdf
注意: 密钥已重新生成,我刚刚将密钥留在其中,因为它会出现在我的代码中以供说明。
我哪里可能出错了?
如REST API supportive documentation, there is a need to specify the Content-Type. There is a need to set the public access to source via JSON file. Set the Content-Type to application/pdf
. To make this work, there is a need to install filetype
package using link
pip install filetype
检查此 link 以更好地实现 REST API 到用户表单识别器 SDK。