用于文本提取的文档布局分析
Document Layout Analysis for text extraction
我需要分析不同文档类型的布局结构,例如:pdf、doc、docx, odt 等
我的任务是:
给出一份文件,将文本分组,找到每个块的正确边界。
我使用 Apache Tika 做了一些测试,这是一个很好的提取器,它是一个非常好的工具,但它经常弄乱块的顺序,让我解释一下我对 ORDER 的意思。
Apache Tika 只提取文本,所以如果我的文档有两列,Tika 提取第一列的整个文本,然后提取第二列的文本,这没问题...但有时第一列与第二列的文本相关,例如具有行关系的 table。
所以我必须注意每个块的位置,所以问题是:
定义框的边界,这很难...我应该明白一个句子是否开始一个新的块。
定义方向,例如,给一个table“句子”应该是行,而不是列。
所以基本上在这里我必须处理布局结构以正确理解块边界。
我举个形象的例子:
经典提取器returns:
2019
2018
2017
2016
2015
2014
Oregon Arts Commission Individual Artist Fellowship...
错误(在我的例子中)因为日期与右边的文本相关。
这个任务是为其他NLP分析做准备,所以很重要,因为,比如做,当我需要识别文本中的实体(NER),然后识别它们之间的关系时,使用正确的上下文非常重要。
如何在同一个块下提取文档和汇编相关的文本片段(了解文档的布局结构)?
对于你的例子,tesseract was able to produce the desired output after configuring the Page segmentation mode via the --psm
flag. See docs
--psm 6
假设一个统一的文本块。
当然,tesseract 可以处理图像。您可以尝试使用 pdf2image. For the .docx, .doc, .odt formats one option would be using pywin32 将 pdf 转换为图像以处理格式为 pdf。
这只是您问题的部分解决方案,但它可以简化手头的任务。
This tool 接收 PDF 文件并将其转换为文本文件。它运行速度非常快,可以 运行 处理大量文件。
它为每个 PDF 创建一个输出文本文件。与其他工具相比,此工具的优势在于输出文本会根据其原始布局对齐。
例如,这是一份布局复杂的简历:
它的输出是以下文本文件:
Christopher Summary
Senior Web Developer specializing in front end development.
Morgan Experienced with all stages of the development cycle for
dynamic web projects. Well-versed in numerous programming
languages including HTML5, PHP OOP, JavaScript, CSS, MySQL.
Strong background in project management and customer
relations.
Skill Highlights
• Project management • Creative design
• Strong decision maker • Innovative
• Complex problem • Service-focused
solver
Experience
Contact
Web Developer - 09/2015 to 05/2019
Address: Luna Web Design, New York
177 Great Portland Street, London • Cooperate with designers to create clean interfaces and
W5W 6PQ simple, intuitive interactions and experiences.
• Develop project concepts and maintain optimal
Phone: workflow.
+44 (0)20 7666 8555
• Work with senior developer to manage large, complex
design projects for corporate clients.
Email:
• Complete detailed programming and development tasks
christoper.m@gmail.com
for front end public and internal websites as well as
challenging back-end server code.
LinkedIn:
• Carry out quality assurance tests to discover errors and
linkedin.com/christopher.morgan
optimize usability.
Languages Education
Spanish – C2
Bachelor of Science: Computer Information Systems - 2014
Chinese – A1
Columbia University, NY
German – A2
Hobbies Certifications
PHP Framework (certificate): Zend, Codeigniter, Symfony.
• Writing
Programming Languages: JavaScript, HTML5, PHP OOP, CSS,
• Sketching
SQL, MySQL.
• Photography
• Design
-----------------------Page 1 End-----------------------
现在您的任务减少为在文本文件中查找批量,并使用单词之间的空格作为对齐提示。
作为开始,我包含了一个脚本,它找到文本列之间的边距并产生 rhs
和 lhs
- 分别是右列和左列的文本流。
import numpy as np
import matplotlib.pyplot as plt
import re
txt_lines = txt.split('\n')
max_line_index = max([len(line) for line in txt_lines])
padded_txt_lines = [line + " " * (max_line_index - len(line)) for line in txt_lines] # pad short lines with spaces
space_idx_counters = np.zeros(max_line_index)
for idx, line in enumerate(padded_txt_lines):
if line.find("-----------------------Page") >= 0: # reached end of page
break
space_idxs = [pos for pos, char in enumerate(line) if char == " "]
space_idx_counters[space_idxs] += 1
padded_txt_lines = padded_txt_lines[:idx] #remove end page line
# plot histogram of spaces in each character column
plt.bar(list(range(len(space_idx_counters))), space_idx_counters)
plt.title("Number of spaces in each column over all lines")
plt.show()
# find the separator column idx
separator_idx = np.argmax(space_idx_counters)
print(f"separator index: {separator_idx}")
left_lines = []
right_lines = []
# separate two columns of text
for line in padded_txt_lines:
left_lines.append(line[:separator_idx])
right_lines.append(line[separator_idx:])
# join each bulk into one stream of text, remove redundant spaces
lhs = ' '.join(left_lines)
lhs = re.sub("\s{4,}", " ", lhs)
rhs = ' '.join(right_lines)
rhs = re.sub("\s{4,}", " ", rhs)
print("************ Left Hand Side ************")
print(lhs)
print("************ Right Hand Side ************")
print(rhs)
绘图输出:
文本输出:
separator index: 33
************ Left Hand Side ************
Christopher Morgan Contact Address: 177 Great Portland Street, London W5W 6PQ Phone: +44 (0)20 7666 8555 Email: christoper.m@gmail.com LinkedIn: linkedin.com/christopher.morgan Languages Spanish – C2 Chinese – A1 German – A2 Hobbies • Writing • Sketching • Photography • Design
************ Right Hand Side ************
Summary Senior Web Developer specializing in front end development. Experienced with all stages of the development cycle for dynamic web projects. Well-versed in numerous programming languages including HTML5, PHP OOP, JavaScript, CSS, MySQL. Strong background in project management and customer relations. Skill Highlights • Project management • Creative design • Strong decision maker • Innovative • Complex problem • Service-focused solver Experience Web Developer - 09/2015 to 05/2019 Luna Web Design, New York • Cooperate with designers to create clean interfaces and simple, intuitive interactions and experiences. • Develop project concepts and maintain optimal workflow. • Work with senior developer to manage large, complex design projects for corporate clients. • Complete detailed programming and development tasks for front end public and internal websites as well as challenging back-end server code. • Carry out quality assurance tests to discover errors and optimize usability. Education Bachelor of Science: Computer Information Systems - 2014 Columbia University, NY Certifications PHP Framework (certificate): Zend, Codeigniter, Symfony. Programming Languages: JavaScript, HTML5, PHP OOP, CSS, SQL, MySQL.
下一步是推广此脚本以处理 multi-page 文档、删除冗余符号等。
祝你好运!
您可以使用 easyocr
。它使用深度学习模型来提取字符。 .它 returns 单词和单词在论文中的位置。这些步骤将把您的文档转换为图像,然后进行分析。
#pip install -U easyocr
import easyocr
language = "en"
image_path = "https://i.stack.imgur.com/i6vHT.png"
reader = easyocr.Reader([language])
response = reader.readtext(image_path, detail=True)
print(response)
这是我们忽略边界框细节的例子。
文本显示正确。
查看 Konfuzio 文档以进行文本分析和提取。您可以定义自己的模型并访问数据。
即使是 2 列布局的文档,您也可以使用 Konfuzio 获取文档的布局结构。
它将文档分割为 5 类:文本、标题、列表、table 和图形。
# pip install konfuzio_sdk
# in working directory
# konfuzio_sdk init
from konfuzio_sdk.api import get_results_from_segmentation
result = get_results_from_segmentation(doc_id=1111, project_id=111)
结果将包含文档中不同元素的边界框和各自的分类。
例如,您可以使用元素的边界框信息来查找哪些元素在同一行中。
https://github.com/konfuzio-ai/document-ai-python-sdk/issues/7
您可能想查看 github 上的 document layout parser。
我需要分析不同文档类型的布局结构,例如:pdf、doc、docx, odt 等
我的任务是: 给出一份文件,将文本分组,找到每个块的正确边界。
我使用 Apache Tika 做了一些测试,这是一个很好的提取器,它是一个非常好的工具,但它经常弄乱块的顺序,让我解释一下我对 ORDER 的意思。
Apache Tika 只提取文本,所以如果我的文档有两列,Tika 提取第一列的整个文本,然后提取第二列的文本,这没问题...但有时第一列与第二列的文本相关,例如具有行关系的 table。
所以我必须注意每个块的位置,所以问题是:
定义框的边界,这很难...我应该明白一个句子是否开始一个新的块。
定义方向,例如,给一个table“句子”应该是行,而不是列。
所以基本上在这里我必须处理布局结构以正确理解块边界。
我举个形象的例子:
经典提取器returns:
2019
2018
2017
2016
2015
2014
Oregon Arts Commission Individual Artist Fellowship...
错误(在我的例子中)因为日期与右边的文本相关。
这个任务是为其他NLP分析做准备,所以很重要,因为,比如做,当我需要识别文本中的实体(NER),然后识别它们之间的关系时,使用正确的上下文非常重要。
如何在同一个块下提取文档和汇编相关的文本片段(了解文档的布局结构)?
对于你的例子,tesseract was able to produce the desired output after configuring the Page segmentation mode via the --psm
flag. See docs
--psm 6
假设一个统一的文本块。
当然,tesseract 可以处理图像。您可以尝试使用 pdf2image. For the .docx, .doc, .odt formats one option would be using pywin32 将 pdf 转换为图像以处理格式为 pdf。
这只是您问题的部分解决方案,但它可以简化手头的任务。 This tool 接收 PDF 文件并将其转换为文本文件。它运行速度非常快,可以 运行 处理大量文件。
它为每个 PDF 创建一个输出文本文件。与其他工具相比,此工具的优势在于输出文本会根据其原始布局对齐。
例如,这是一份布局复杂的简历:
它的输出是以下文本文件:
Christopher Summary
Senior Web Developer specializing in front end development.
Morgan Experienced with all stages of the development cycle for
dynamic web projects. Well-versed in numerous programming
languages including HTML5, PHP OOP, JavaScript, CSS, MySQL.
Strong background in project management and customer
relations.
Skill Highlights
• Project management • Creative design
• Strong decision maker • Innovative
• Complex problem • Service-focused
solver
Experience
Contact
Web Developer - 09/2015 to 05/2019
Address: Luna Web Design, New York
177 Great Portland Street, London • Cooperate with designers to create clean interfaces and
W5W 6PQ simple, intuitive interactions and experiences.
• Develop project concepts and maintain optimal
Phone: workflow.
+44 (0)20 7666 8555
• Work with senior developer to manage large, complex
design projects for corporate clients.
Email:
• Complete detailed programming and development tasks
christoper.m@gmail.com
for front end public and internal websites as well as
challenging back-end server code.
LinkedIn:
• Carry out quality assurance tests to discover errors and
linkedin.com/christopher.morgan
optimize usability.
Languages Education
Spanish – C2
Bachelor of Science: Computer Information Systems - 2014
Chinese – A1
Columbia University, NY
German – A2
Hobbies Certifications
PHP Framework (certificate): Zend, Codeigniter, Symfony.
• Writing
Programming Languages: JavaScript, HTML5, PHP OOP, CSS,
• Sketching
SQL, MySQL.
• Photography
• Design
-----------------------Page 1 End-----------------------
现在您的任务减少为在文本文件中查找批量,并使用单词之间的空格作为对齐提示。
作为开始,我包含了一个脚本,它找到文本列之间的边距并产生 rhs
和 lhs
- 分别是右列和左列的文本流。
import numpy as np
import matplotlib.pyplot as plt
import re
txt_lines = txt.split('\n')
max_line_index = max([len(line) for line in txt_lines])
padded_txt_lines = [line + " " * (max_line_index - len(line)) for line in txt_lines] # pad short lines with spaces
space_idx_counters = np.zeros(max_line_index)
for idx, line in enumerate(padded_txt_lines):
if line.find("-----------------------Page") >= 0: # reached end of page
break
space_idxs = [pos for pos, char in enumerate(line) if char == " "]
space_idx_counters[space_idxs] += 1
padded_txt_lines = padded_txt_lines[:idx] #remove end page line
# plot histogram of spaces in each character column
plt.bar(list(range(len(space_idx_counters))), space_idx_counters)
plt.title("Number of spaces in each column over all lines")
plt.show()
# find the separator column idx
separator_idx = np.argmax(space_idx_counters)
print(f"separator index: {separator_idx}")
left_lines = []
right_lines = []
# separate two columns of text
for line in padded_txt_lines:
left_lines.append(line[:separator_idx])
right_lines.append(line[separator_idx:])
# join each bulk into one stream of text, remove redundant spaces
lhs = ' '.join(left_lines)
lhs = re.sub("\s{4,}", " ", lhs)
rhs = ' '.join(right_lines)
rhs = re.sub("\s{4,}", " ", rhs)
print("************ Left Hand Side ************")
print(lhs)
print("************ Right Hand Side ************")
print(rhs)
绘图输出:
文本输出:
separator index: 33
************ Left Hand Side ************
Christopher Morgan Contact Address: 177 Great Portland Street, London W5W 6PQ Phone: +44 (0)20 7666 8555 Email: christoper.m@gmail.com LinkedIn: linkedin.com/christopher.morgan Languages Spanish – C2 Chinese – A1 German – A2 Hobbies • Writing • Sketching • Photography • Design
************ Right Hand Side ************
Summary Senior Web Developer specializing in front end development. Experienced with all stages of the development cycle for dynamic web projects. Well-versed in numerous programming languages including HTML5, PHP OOP, JavaScript, CSS, MySQL. Strong background in project management and customer relations. Skill Highlights • Project management • Creative design • Strong decision maker • Innovative • Complex problem • Service-focused solver Experience Web Developer - 09/2015 to 05/2019 Luna Web Design, New York • Cooperate with designers to create clean interfaces and simple, intuitive interactions and experiences. • Develop project concepts and maintain optimal workflow. • Work with senior developer to manage large, complex design projects for corporate clients. • Complete detailed programming and development tasks for front end public and internal websites as well as challenging back-end server code. • Carry out quality assurance tests to discover errors and optimize usability. Education Bachelor of Science: Computer Information Systems - 2014 Columbia University, NY Certifications PHP Framework (certificate): Zend, Codeigniter, Symfony. Programming Languages: JavaScript, HTML5, PHP OOP, CSS, SQL, MySQL.
下一步是推广此脚本以处理 multi-page 文档、删除冗余符号等。
祝你好运!
您可以使用 easyocr
。它使用深度学习模型来提取字符。 .它 returns 单词和单词在论文中的位置。这些步骤将把您的文档转换为图像,然后进行分析。
#pip install -U easyocr
import easyocr
language = "en"
image_path = "https://i.stack.imgur.com/i6vHT.png"
reader = easyocr.Reader([language])
response = reader.readtext(image_path, detail=True)
print(response)
这是我们忽略边界框细节的例子。
文本显示正确。
查看 Konfuzio 文档以进行文本分析和提取。您可以定义自己的模型并访问数据。
即使是 2 列布局的文档,您也可以使用 Konfuzio 获取文档的布局结构。 它将文档分割为 5 类:文本、标题、列表、table 和图形。
# pip install konfuzio_sdk
# in working directory
# konfuzio_sdk init
from konfuzio_sdk.api import get_results_from_segmentation
result = get_results_from_segmentation(doc_id=1111, project_id=111)
结果将包含文档中不同元素的边界框和各自的分类。 例如,您可以使用元素的边界框信息来查找哪些元素在同一行中。
https://github.com/konfuzio-ai/document-ai-python-sdk/issues/7
您可能想查看 github 上的 document layout parser。