为什么 Google PDF DOCUMENT_TEXT_DETECTION API 比 Google JPG DOCUMENT_TEXT_DETECTION API 慢很多
Why is Google PDF DOCUMENT_TEXT_DETECTION API much slower than Google JPG DOCUMENT_TEXT_DETECTION API
我注意到 Google Vision PDF OCR DOCUMENT_TEXT_DETECTION 需要大约 15 秒来检测单个 PDF 页面 https://cloud.google.com/vision/docs/pdf。
但是,如果我提交与 JPG 相同的 PDF 页面,检测文本所需的时间不到 3 秒 https://cloud.google.com/vision/docs/detecting-fulltext
我使用了此处提供的代码 (C#)https://cloud.google.com/vision/docs/pdf#vision-pdf-detection-gcs-csharp
我注意到以下代码行需要大约 15 秒才能检测到 PDF 中的所有文本并将其保存到 gsBucket
operation.PollUntilCompleted();
- 我的 GsBucket 是 "Multi-Regional Storage" 美国
- 我也从美国位置上传
我想知道我还能做些什么来加快这个过程,或者这是预期的?
您可能会在Google Groups thread中找到您的查询的答案。总结:
The offline batch API is not designed to take short running time as
the first priority. Instead, it aims to provide scheduling for a large
number of multi-page PDF/TIFF files according to quota limits. So
instead of sending PDF/TIFF files one by one and wait for each one to
succeed, the typical way to use it is to send as many PDF/TIFF files
as possible at one time or continuously, track each operation id to
get the final result of each PDF/TIFF processing.
C# 客户端库中似乎还没有小批量在线处理feature mentioned in the comments。解决方法是直接调用 REST API 或使用不同语言的客户端库。
我注意到 Google Vision PDF OCR DOCUMENT_TEXT_DETECTION 需要大约 15 秒来检测单个 PDF 页面 https://cloud.google.com/vision/docs/pdf。
但是,如果我提交与 JPG 相同的 PDF 页面,检测文本所需的时间不到 3 秒 https://cloud.google.com/vision/docs/detecting-fulltext
我使用了此处提供的代码 (C#)https://cloud.google.com/vision/docs/pdf#vision-pdf-detection-gcs-csharp
我注意到以下代码行需要大约 15 秒才能检测到 PDF 中的所有文本并将其保存到 gsBucket
operation.PollUntilCompleted();
- 我的 GsBucket 是 "Multi-Regional Storage" 美国
- 我也从美国位置上传
我想知道我还能做些什么来加快这个过程,或者这是预期的?
您可能会在Google Groups thread中找到您的查询的答案。总结:
The offline batch API is not designed to take short running time as the first priority. Instead, it aims to provide scheduling for a large number of multi-page PDF/TIFF files according to quota limits. So instead of sending PDF/TIFF files one by one and wait for each one to succeed, the typical way to use it is to send as many PDF/TIFF files as possible at one time or continuously, track each operation id to get the final result of each PDF/TIFF processing.
C# 客户端库中似乎还没有小批量在线处理feature mentioned in the comments。解决方法是直接调用 REST API 或使用不同语言的客户端库。