Google 文档 Ai 为同一文件提供不同的输出

Question

我使用 Document OCR API 从 pdf 文件中提取文本，但部分内容不准确。我发现原因可能是由于某些汉字的存在。

下面是我捏造的例子，其中我裁剪了提取文本错误的部分区域并添加了一些汉字以重现问题。

当我使用website version时，我无法获取汉字，但其余字符是正确的。

当我使用Python提取文本时，我可以正确提取汉字，但剩下的部分字符是错误的。

我得到的实际字符串。

网站上Document AI的版本和API有区别吗？如何正确获取所有字符？

更新：

当我打印 detected_languages 时（不知道为什么 lines = page.lines，两行的 detected_languages 都是空列表，需要更改为 page.blocks 或page.paragraphs first) 打印文本后，我得到以下输出。

代码：

from google.cloud import documentai_v1beta3 as documentai

project_id= 'secret-medium-xxxxxx'
location = 'us' # Format is 'us' or 'eu'
processor_id = 'abcdefg123456' #  Create processor in Cloud Console

opts = {}
if location == "eu":
    opts = {"api_endpoint": "eu-documentai.googleapis.com"}
client = documentai.DocumentProcessorServiceClient(client_options=opts)

def get_text(doc_element: dict, document: dict):
    """
    Document AI identifies form fields by their offsets
    in document text. This function converts offsets
    to text snippets.
    """
    response = ""
    # If a text segment spans several lines, it will
    # be stored in different text segments.
    for segment in doc_element.text_anchor.text_segments:
        start_index = (
            int(segment.start_index)
            if segment in doc_element.text_anchor.text_segments
            else 0
        )
        end_index = int(segment.end_index)
        response += document.text[start_index:end_index]
    return response

def get_lines_of_text(file_path: str, location: str = location, processor_id: str = processor_id, project_id: str = project_id):

    # You must set the api_endpoint if you use a location other than 'us', e.g.:
    # opts = {}
    # if location == "eu":
    #     opts = {"api_endpoint": "eu-documentai.googleapis.com"}

    # The full resource name of the processor, e.g.:
    # projects/project-id/locations/location/processor/processor-id
    # You must create new processors in the Cloud Console first
    name = f"projects/{project_id}/locations/{location}/processors/{processor_id}"

    # Read the file into memory
    with open(file_path, "rb") as image:
    image_content = image.read()

    document = {"content": image_content, "mime_type": "application/pdf"}

    # Configure the process request
    request = {"name": name, "raw_document": document}

    result = client.process_document(request=request)
    document = result.document

    document_pages = document.pages

    response_text = []
    # For a full list of Document object attributes, please reference this page: https://googleapis.dev/python/documentai/latest/_modules/google/cloud/documentai_v1beta3/types/document.html#Document

    # Read the text recognition output from the processor
    print("The document contains the following paragraphs:")
    for page in document_pages:
        lines = page.blocks
        for line in lines:
            block_text = get_text(line.layout, document)
            confidence = line.layout.confidence
            response_text.append((block_text[:-1] if block_text[-1:] == '\n' else block_text, confidence))
            print(f"Text: {block_text}")
            print("Detected Language", line.detected_languages)
    return response_text

if __name__ == '__main__':
    print(get_lines_of_text('/pdf path'))

好像语言代码不对，会影响结果吗？

Answer 1

为更好地发布此 Community Wiki visibility。

DocumentAI 的一个功能是 OCR - Optical Character Recognition，它允许从各种文件中识别文本。

OP 在这种情况下使用 Try it function and Client Libraries - Python 接收到不同的输出。

为什么 Try it 和 Python library 之间存在差异？很难说，因为这两种方法使用相同的 API documentai_v1beta3。这可能与 pdf 上传到 Try it Demo 时的一些文件修改、不同的端点、语言字母识别或一些随机的东西有关。

当您使用 Python Client 时，您还可以获得文本识别的准确度百分比。下面是我睾丸的例子：

然而，OP 的识别大约是 0,73，所以它可能会得到错误的结果，在这种情况下是一个可见的问题。我想无论如何都不能使用代码来改进它。也许 PDF 的质量会有所不同（在显示的 OP 示例中，有一些点可能会影响识别）。

Google 文档 Ai 为同一文件提供不同的输出

Google Document Ai giving different outputs for the same file

python

ocr

google-api-python-client

google-cloud-platform

cloud-document-ai