Google 文档 Ai 为同一文件提供不同的输出

Google Document Ai giving different outputs for the same file

我使用 Document OCR API 从 pdf 文件中提取文本,但部分内容不准确。我发现原因可能是由于某些汉字的存在。


当我使用website version时,我无法获取汉字,但其余字符是正确的。



网站上Document AI的版本和API有区别吗?如何正确获取所有字符?


当我打印 detected_languages 时(不知道为什么 lines = page.lines,两行的 detected_languages 都是空列表,需要更改为 page.blockspage.paragraphs first) 打印文本后,我得到以下输出。


from import documentai_v1beta3 as documentai

project_id= 'secret-medium-xxxxxx'
location = 'us' # Format is 'us' or 'eu'
processor_id = 'abcdefg123456' #  Create processor in Cloud Console

opts = {}
if location == "eu":
    opts = {"api_endpoint": ""}
client = documentai.DocumentProcessorServiceClient(client_options=opts)

def get_text(doc_element: dict, document: dict):
    Document AI identifies form fields by their offsets
    in document text. This function converts offsets
    to text snippets.
    response = ""
    # If a text segment spans several lines, it will
    # be stored in different text segments.
    for segment in doc_element.text_anchor.text_segments:
        start_index = (
            if segment in doc_element.text_anchor.text_segments
            else 0
        end_index = int(segment.end_index)
        response += document.text[start_index:end_index]
    return response

def get_lines_of_text(file_path: str, location: str = location, processor_id: str = processor_id, project_id: str = project_id):

    # You must set the api_endpoint if you use a location other than 'us', e.g.:
    # opts = {}
    # if location == "eu":
    #     opts = {"api_endpoint": ""}

    # The full resource name of the processor, e.g.:
    # projects/project-id/locations/location/processor/processor-id
    # You must create new processors in the Cloud Console first
    name = f"projects/{project_id}/locations/{location}/processors/{processor_id}"

    # Read the file into memory
    with open(file_path, "rb") as image:
    image_content =

    document = {"content": image_content, "mime_type": "application/pdf"}

    # Configure the process request
    request = {"name": name, "raw_document": document}

    result = client.process_document(request=request)
    document = result.document

    document_pages = document.pages

    response_text = []
    # For a full list of Document object attributes, please reference this page:

    # Read the text recognition output from the processor
    print("The document contains the following paragraphs:")
    for page in document_pages:
        lines = page.blocks
        for line in lines:
            block_text = get_text(line.layout, document)
            confidence = line.layout.confidence
            response_text.append((block_text[:-1] if block_text[-1:] == '\n' else block_text, confidence))
            print(f"Text: {block_text}")
            print("Detected Language", line.detected_languages)
    return response_text

if __name__ == '__main__':
    print(get_lines_of_text('/pdf path'))


为更好地发布此 Community Wiki visibility

DocumentAI 的一个功能是 OCR - Optical Character Recognition,它允许从各种文件中识别文本。

OP 在这种情况下使用 Try it function and Client Libraries - Python 接收到不同的输出。

为什么 Try itPython library 之间存在差异? 很难说,因为这两种方法使用相同的 API documentai_v1beta3。这可能与 pdf 上传到 Try it Demo 时的一些文件修改、不同的端点、语言字母识别或一些随机的东西有关。

当您使用 Python Client 时,您还可以获得文本识别的准确度百分比。下面是我睾丸的例子:

然而,OP 的识别大约是 0,73,所以它可能会得到错误的结果,在这种情况下是一个可见的问题。我想无论如何都不能使用代码来改进它。也许 PDF 的质量会有所不同(在显示的 OP 示例中,有一些点可能会影响识别)。