为什么 GCP Vision API returns 在 python 中的结果比在线演示更差

Question

我写了一个基本的 python 脚本来调用和使用 GCP Vision API。我的目标是向它发送产品图像并检索（使用 OCR）写在这个盒子上的文字。我有一个预定义的品牌列表，因此我可以在 API 品牌返回的文本中搜索并检测它是什么。

我的 python 脚本如下：

import  io
from google.cloud import vision
from google.cloud.vision import types
import os
import cv2
import numpy as np

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "**************************"


def detect_text(file):
    """Detects text in the file."""
    client = vision.ImageAnnotatorClient()

    with io.open(file, 'rb') as image_file:
        content = image_file.read()

    image = types.Image(content=content)

    response = client.text_detection(image=image)
    texts = response.text_annotations
    print('Texts:')

    for text in texts:
        print('\n"{}"'.format(text.description))

        vertices = (['({},{})'.format(vertex.x, vertex.y)
                    for vertex in text.bounding_poly.vertices])

        print('bounds: {}'.format(','.join(vertices)))


file_name = "Image.jpg"
img = cv2.imread(file_name)

detect_text(file_name)

目前，我正在试验以下产品图片：（951∆×∆335 分辨率）

它的品牌是Acuvue。

问题如下。当我测试 GCP Cloud Vision 的在线演示 API 时，我得到此图像的以下文本结果：

FOR ASTIGMATISM 1-DAY ACUVUE MOIST WITH LACREON™ 30 Lenses BRAND CONTACT LENSES UV BLOCKING

(这个 returns 的 json 结果包括上面所有的词，包括对我很重要的词 Acuvue 但 json 太长 post 在这里）

因此，在线演示可以很好地检测产品上的文字，至少它可以准确地检测到单词 Acuvue（即品牌）。但是，当我在我的 python 脚本中使用相同的图像调用相同的 API 时，我得到以下结果：

Texts:

"1.DAY
FOR ASTIGMATISM
WITH
LACREONTM
MOIS
30 Lenses
BRAND CONTACT LENSES
UV BLOCKING
"
bounds: (221,101),(887,101),(887,284),(221,284)

"1.DAY"
bounds: (221,101),(312,101),(312,125),(221,125)

"FOR"
bounds: (622,107),(657,107),(657,119),(622,119)

"ASTIGMATISM"
bounds: (664,107),(788,107),(788,119),(664,119)

"WITH"
bounds: (614,136),(647,136),(647,145),(614,145)

"LACREONTM"
bounds: (600,151),(711,146),(712,161),(601,166)

"MOIS"
bounds: (378,162),(525,153),(528,200),(381,209)

"30"
bounds: (614,177),(629,178),(629,188),(614,187)

"Lenses"
bounds: (634,178),(677,180),(677,189),(634,187)

"BRAND"
bounds: (361,210),(418,210),(418,218),(361,218)

"CONTACT"
bounds: (427,209),(505,209),(505,218),(427,218)

"LENSES"
bounds: (514,209),(576,209),(576,218),(514,218)

"UV"
bounds: (805,274),(823,274),(823,284),(805,284)

"BLOCKING"
bounds: (827,276),(887,276),(887,284),(827,284)

但这并没有像演示那样检测到单词 "Acuvue"！！

为什么会这样？

我可以修复我的 python 脚本中的某些内容以使其正常工作吗？

Answer 1

From the docs:

The Vision API can detect and extract text from images. There are two annotation features that support OCR:

TEXT_DETECTION detects and extracts text from any image. For example, a photograph might contain a street sign or traffic sign. The JSON includes the entire extracted string, as well as individual words, and their bounding boxes.

DOCUMENT_TEXT_DETECTION also extracts text from an image, but the response is optimized for dense text and documents. The JSON includes page, block, paragraph, word, and break information.)

我希望网络 API 实际上使用的是后者，然后根据置信度过滤结果。

A DOCUMENT_TEXT_DETECTION response includes additional layout information, such as page, block, paragraph, word, and break information, along with confidence scores for each.

无论如何，我希望（并且我的经验是）后一种方法会“更加努力地”找到所有字符串。

我不认为你做错了什么。只有两种并行检测方法。一个 (DOCUMENT_TEXT_DETECTION) 更强烈，针对文档进行了优化（可能针对拉直、对齐和均匀间隔的线条），并提供了一些应用程序可能不需要的更多信息。

所以我建议您按照 Python example here.

修改您的代码

最后，我的猜测是您询问的 242 是与它认为在尝试识别 ™ 符号时找到的 utf-8 字符对应的转义八进制值。

如果您使用以下代码段：

b = b"242"
s = b.decode('utf8')
print(s)

你会很高兴看到它打印 ™。

为什么 GCP Vision API returns 在 python 中的结果比在线演示更差

Why GCP Vision API returns worse results in python than at its online demo

python

ocr

google-cloud-vision