用于文本提取的文档布局分析

Question

我需要分析不同文档类型的布局结构，例如：pdf、doc、docx, odt 等

我的任务是：给出一份文件，将文本分组，找到每个块的正确边界。

我使用 Apache Tika 做了一些测试，这是一个很好的提取器，它是一个非常好的工具，但它经常弄乱块的顺序，让我解释一下我对 ORDER 的意思。

Apache Tika 只提取文本，所以如果我的文档有两列，Tika 提取第一列的整个文本，然后提取第二列的文本，这没问题...但有时第一列与第二列的文本相关，例如具有行关系的 table。

所以我必须注意每个块的位置，所以问题是：

定义框的边界，这很难...我应该明白一个句子是否开始一个新的块。
定义方向，例如，给一个table“句子”应该是行，而不是列。

所以基本上在这里我必须处理布局结构以正确理解块边界。

我举个形象的例子：

经典提取器returns:

2019
2018
2017
2016
2015
2014
Oregon Arts Commission Individual Artist Fellowship...

错误（在我的例子中）因为日期与右边的文本相关。

这个任务是为其他NLP分析做准备，所以很重要，因为，比如做，当我需要识别文本中的实体（NER），然后识别它们之间的关系时，使用正确的上下文非常重要。

如何在同一个块下提取文档和汇编相关的文本片段（了解文档的布局结构）？

Answer 1

对于你的例子，tesseract was able to produce the desired output after configuring the Page segmentation mode via the --psm flag. See docs

--psm 6 假设一个统一的文本块。

当然，tesseract 可以处理图像。您可以尝试使用 pdf2image. For the .docx, .doc, .odt formats one option would be using pywin32 将 pdf 转换为图像以处理格式为 pdf。

Answer 2

这只是您问题的部分解决方案，但它可以简化手头的任务。 This tool 接收 PDF 文件并将其转换为文本文件。它运行速度非常快，可以运行处理大量文件。

它为每个 PDF 创建一个输出文本文件。与其他工具相比，此工具的优势在于输出文本会根据其原始布局对齐。

例如，这是一份布局复杂的简历：

它的输出是以下文本文件：

Christopher                         Summary
                                    Senior Web Developer specializing in front end development.
Morgan                              Experienced with all stages of the development cycle for
                                    dynamic web projects. Well-versed in numerous programming
                                    languages including HTML5, PHP OOP, JavaScript, CSS, MySQL.
                                    Strong background in project management and customer
                                    relations.


                                    Skill Highlights
                                        •   Project management          •   Creative design
                                        •   Strong decision maker       •   Innovative
                                        •   Complex problem             •   Service-focused
                                            solver


                                    Experience
Contact
                                    Web Developer - 09/2015 to 05/2019
Address:                            Luna Web Design, New York
177 Great Portland Street, London      • Cooperate with designers to create clean interfaces and
W5W 6PQ                                   simple, intuitive interactions and experiences.
                                       • Develop project concepts and maintain optimal
Phone:                                    workflow.
+44 (0)20 7666 8555
                                       • Work with senior developer to manage large, complex
                                          design projects for corporate clients.
Email:
                                       • Complete detailed programming and development tasks
christoper.m@gmail.com
                                          for front end public and internal websites as well as
                                          challenging back-end server code.
LinkedIn:
                                       • Carry out quality assurance tests to discover errors and
linkedin.com/christopher.morgan
                                          optimize usability.

Languages                           Education
Spanish – C2
                                    Bachelor of Science: Computer Information Systems - 2014
Chinese – A1
                                    Columbia University, NY
German – A2


Hobbies                             Certifications
                                    PHP Framework (certificate): Zend, Codeigniter, Symfony.
   •   Writing
                                    Programming Languages: JavaScript, HTML5, PHP OOP, CSS,
   •   Sketching
                                    SQL, MySQL.
   •   Photography
   •   Design
-----------------------Page 1 End-----------------------

现在您的任务减少为在文本文件中查找批量，并使用单词之间的空格作为对齐提示。作为开始，我包含了一个脚本，它找到文本列之间的边距并产生 rhs 和 lhs - 分别是右列和左列的文本流。

import numpy as np
import matplotlib.pyplot as plt
import re

txt_lines = txt.split('\n')
max_line_index = max([len(line) for line in txt_lines])
padded_txt_lines = [line + " " * (max_line_index - len(line)) for line in txt_lines] # pad short lines with spaces
space_idx_counters = np.zeros(max_line_index)

for idx, line in enumerate(padded_txt_lines):
    if line.find("-----------------------Page") >= 0: # reached end of page
        break
    space_idxs = [pos for pos, char in enumerate(line) if char == " "]
    space_idx_counters[space_idxs] += 1

padded_txt_lines = padded_txt_lines[:idx] #remove end page line

# plot histogram of spaces in each character column
plt.bar(list(range(len(space_idx_counters))), space_idx_counters)
plt.title("Number of spaces in each column over all lines")
plt.show()

# find the separator column idx
separator_idx = np.argmax(space_idx_counters)
print(f"separator index: {separator_idx}")
left_lines = []
right_lines = []

# separate two columns of text
for line in padded_txt_lines:
    left_lines.append(line[:separator_idx])
    right_lines.append(line[separator_idx:])

# join each bulk into one stream of text, remove redundant spaces
lhs = ' '.join(left_lines)
lhs = re.sub("\s{4,}", " ", lhs)
rhs = ' '.join(right_lines)
rhs = re.sub("\s{4,}", " ", rhs)

print("************ Left Hand Side ************")
print(lhs)
print("************ Right Hand Side ************")
print(rhs)

绘图输出：

文本输出：

separator index: 33
************ Left Hand Side ************
Christopher Morgan Contact Address: 177 Great Portland Street, London W5W 6PQ Phone: +44 (0)20 7666 8555 Email: christoper.m@gmail.com LinkedIn: linkedin.com/christopher.morgan Languages Spanish – C2 Chinese – A1 German – A2 Hobbies •   Writing •   Sketching •   Photography •   Design 
************ Right Hand Side ************
   Summary Senior Web Developer specializing in front end development. Experienced with all stages of the development cycle for dynamic web projects. Well-versed in numerous programming languages including HTML5, PHP OOP, JavaScript, CSS, MySQL. Strong background in project management and customer relations. Skill Highlights •   Project management •   Creative design •   Strong decision maker •   Innovative •   Complex problem •   Service-focused solver Experience Web Developer - 09/2015 to 05/2019 Luna Web Design, New York • Cooperate with designers to create clean interfaces and simple, intuitive interactions and experiences. • Develop project concepts and maintain optimal workflow. • Work with senior developer to manage large, complex design projects for corporate clients. • Complete detailed programming and development tasks for front end public and internal websites as well as challenging back-end server code. • Carry out quality assurance tests to discover errors and optimize usability. Education Bachelor of Science: Computer Information Systems - 2014 Columbia University, NY Certifications PHP Framework (certificate): Zend, Codeigniter, Symfony. Programming Languages: JavaScript, HTML5, PHP OOP, CSS, SQL, MySQL.

下一步是推广此脚本以处理 multi-page 文档、删除冗余符号等。

祝你好运！

Answer 3

您可以使用 easyocr。它使用深度学习模型来提取字符。 .它 returns 单词和单词在论文中的位置。这些步骤将把您的文档转换为图像，然后进行分析。

#pip install -U easyocr
import easyocr

language = "en"

image_path = "https://i.stack.imgur.com/i6vHT.png"

reader = easyocr.Reader([language])
response = reader.readtext(image_path, detail=True)

print(response)

这是我们忽略边界框细节的例子。

文本显示正确。

Answer 4

查看 Konfuzio 文档以进行文本分析和提取。您可以定义自己的模型并访问数据。

即使是 2 列布局的文档，您也可以使用 Konfuzio 获取文档的布局结构。它将文档分割为 5 类：文本、标题、列表、table 和图形。

# pip install konfuzio_sdk
# in working directory
# konfuzio_sdk init

from konfuzio_sdk.api import get_results_from_segmentation

result = get_results_from_segmentation(doc_id=1111, project_id=111)

结果将包含文档中不同元素的边界框和各自的分类。例如，您可以使用元素的边界框信息来查找哪些元素在同一行中。

https://github.com/konfuzio-ai/document-ai-python-sdk/issues/7

Answer 5

您可能想查看 github 上的 document layout parser。

用于文本提取的文档布局分析

Document Layout Analysis for text extraction

python

nlp

artificial-intelligence

machine-learning