单个单词的 PDFMiner 提取 - LTText LTTextBox
PDFMiner Extraction for Single Words - LTText LTTextBox
我在下面的示例中使用 PDFMiner 生成单词 x,y 坐标,但是结果是逐行生成的。我怎样才能将每个单词从另一个单词中拆分出来,而不是逐行拆分单词组(参见下面的示例)。我已经尝试了 PDFMiner tutorial 中的几个参数。 LTTextBox
和 LTText
都试过了。此外,我不能使用文本分析中通常使用的开始和结束偏移量。
这个 PDF 是一个很好的例子,它用在下面的代码中。
http://www.africau.edu/images/default/sample.pdf
from pdfminer.layout import LAParams, LTTextBox, LTText
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFPageInterpreter, PDFResourceManager
from pdfminer.converter import PDFPageAggregator
#Imports Searchable PDFs and prints x,y coordinates
fp = open('C:\sample.pdf', 'rb')
manager = PDFResourceManager()
laparams = LAParams()
dev = PDFPageAggregator(manager, laparams=laparams)
interpreter = PDFPageInterpreter(manager, dev)
pages = PDFPage.get_pages(fp)
for page in pages:
print('--- Processing ---')
interpreter.process_page(page)
layout = dev.get_result()
for lobj in layout:
if isinstance(lobj, LTText):
x, y, text = lobj.bbox[0], lobj.bbox[3], lobj.get_text()
print('At %r is text: %s' % ((x, y), text))
此 returns 可搜索 PDF 的 x,y 坐标如下所示:
--- Processing ---
At (57.375, 747.903) is text: A Simple PDF File
At (69.25, 698.098) is text: This is a small demonstration .pdf file -
At (69.25, 674.194) is text: just for use in the Virtual Mechanics tutorials. More text. And more
text. And more text. And more text. And more text.
想要的结果(坐标代为演示):
--- Processing ---
At (57.375, 747.903) is text: A
At (69.25, 698.098) is text: Simple
At (69.25, 674.194) is text: PDF
At (69.25, 638.338) is text: File
使用 PDFMiner,在遍历每一行之后(就像你已经做的那样),你只能遍历行中的每个字符。
我用下面的代码做到了这一点,同时尝试记录每个单词第一个字符的 x、y 并设置条件以在每个 LTAnno
(例如 \n )或 .get_text() == ' '
空space.
from pdfminer.layout import LAParams, LTTextBox, LTText, LTChar, LTAnno
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFPageInterpreter, PDFResourceManager
from pdfminer.converter import PDFPageAggregator
#Imports Searchable PDFs and prints x,y coordinates
fp = open('C:\sample.pdf', 'rb')
manager = PDFResourceManager()
laparams = LAParams()
dev = PDFPageAggregator(manager, laparams=laparams)
interpreter = PDFPageInterpreter(manager, dev)
pages = PDFPage.get_pages(fp)
for page in pages:
print('--- Processing ---')
interpreter.process_page(page)
layout = dev.get_result()
x, y, text = -1, -1, ''
for textbox in layout:
if isinstance(textbox, LTText):
for line in textbox:
for char in line:
# If the char is a line-break or an empty space, the word is complete
if isinstance(char, LTAnno) or char.get_text() == ' ':
if x != -1:
print('At %r is text: %s' % ((x, y), text))
x, y, text = -1, -1, ''
elif isinstance(char, LTChar):
text += char.get_text()
if x == -1:
x, y, = char.bbox[0], char.bbox[3]
# If the last symbol in the PDF was neither an empty space nor a LTAnno, print the word here
if x != -1:
print('At %r is text: %s' % ((x, y), text))
输出如下所示
At (64.881, 747.903) is text: A
At (90.396, 747.903) is text: Simple
At (180.414, 747.903) is text: PDF
At (241.92, 747.903) is text: File
也许您可以优化条件来检测您的要求和喜好的词。 (例如,在词尾剪掉标点符号 .!?)
读者可能想尝试的另一个包是 pdfparser,它也是基于 Poppler 构建的(使用 Cyton 绑定)并且恰好在性能上得到了更优化
pdfreader pdfminer speed-up factor
tiny document (half page) 0.033s 0.121s 3.6 x
small document (5 pages) 0.141s 0.810s 5.7 x
medium document (55 pages) 1.166s 10.524s 9.0 x
large document (436 pages) 10.581s 108.095s 10.2 x
除了速度更快之外,它的错误处理也更好,并解决了 Pdfminer 遇到的几个问题
import pdfparser.poppler as pdf
import sys
d=pdf.Document(sys.argv[1])
print('No of pages', d.no_of_pages)
for p in d:
print('Page', p.page_no, 'size =', p.size)
for f in p:
print(' '*1,'Flow')
for b in f:
print(' '*2,'Block', 'bbox=', b.bbox.as_tuple())
for l in b:
print(' '*3, l.text.encode('UTF-8'), '(%0.2f, %0.2f, %0.2f, %0.2f)'% l.bbox.as_tuple())
#assert l.char_fonts.comp_ratio < 1.0
for i in range(len(l.text)):
print(l.text[i].encode('UTF-8'), '(%0.2f, %0.2f, %0.2f, %0.2f)'% l.char_bboxes[i].as_tuple(),\
l.char_fonts[i].name, l.char_fonts[i].size, l.char_fonts[i].color,)
print()
你可以清楚地看到源代码是最短的,但仍然提供了所有必要的数据,包括字体颜色、字体大小、字体系列。
更重要的是,您可以直接在块 9 中获得单词,仅比字符高一级)。避免 space 检查逻辑,必须与 Pdfminer 一起使用。
我在下面的示例中使用 PDFMiner 生成单词 x,y 坐标,但是结果是逐行生成的。我怎样才能将每个单词从另一个单词中拆分出来,而不是逐行拆分单词组(参见下面的示例)。我已经尝试了 PDFMiner tutorial 中的几个参数。 LTTextBox
和 LTText
都试过了。此外,我不能使用文本分析中通常使用的开始和结束偏移量。
这个 PDF 是一个很好的例子,它用在下面的代码中。
http://www.africau.edu/images/default/sample.pdf
from pdfminer.layout import LAParams, LTTextBox, LTText
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFPageInterpreter, PDFResourceManager
from pdfminer.converter import PDFPageAggregator
#Imports Searchable PDFs and prints x,y coordinates
fp = open('C:\sample.pdf', 'rb')
manager = PDFResourceManager()
laparams = LAParams()
dev = PDFPageAggregator(manager, laparams=laparams)
interpreter = PDFPageInterpreter(manager, dev)
pages = PDFPage.get_pages(fp)
for page in pages:
print('--- Processing ---')
interpreter.process_page(page)
layout = dev.get_result()
for lobj in layout:
if isinstance(lobj, LTText):
x, y, text = lobj.bbox[0], lobj.bbox[3], lobj.get_text()
print('At %r is text: %s' % ((x, y), text))
此 returns 可搜索 PDF 的 x,y 坐标如下所示:
--- Processing ---
At (57.375, 747.903) is text: A Simple PDF File
At (69.25, 698.098) is text: This is a small demonstration .pdf file -
At (69.25, 674.194) is text: just for use in the Virtual Mechanics tutorials. More text. And more
text. And more text. And more text. And more text.
想要的结果(坐标代为演示):
--- Processing ---
At (57.375, 747.903) is text: A
At (69.25, 698.098) is text: Simple
At (69.25, 674.194) is text: PDF
At (69.25, 638.338) is text: File
使用 PDFMiner,在遍历每一行之后(就像你已经做的那样),你只能遍历行中的每个字符。
我用下面的代码做到了这一点,同时尝试记录每个单词第一个字符的 x、y 并设置条件以在每个 LTAnno
(例如 \n )或 .get_text() == ' '
空space.
from pdfminer.layout import LAParams, LTTextBox, LTText, LTChar, LTAnno
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFPageInterpreter, PDFResourceManager
from pdfminer.converter import PDFPageAggregator
#Imports Searchable PDFs and prints x,y coordinates
fp = open('C:\sample.pdf', 'rb')
manager = PDFResourceManager()
laparams = LAParams()
dev = PDFPageAggregator(manager, laparams=laparams)
interpreter = PDFPageInterpreter(manager, dev)
pages = PDFPage.get_pages(fp)
for page in pages:
print('--- Processing ---')
interpreter.process_page(page)
layout = dev.get_result()
x, y, text = -1, -1, ''
for textbox in layout:
if isinstance(textbox, LTText):
for line in textbox:
for char in line:
# If the char is a line-break or an empty space, the word is complete
if isinstance(char, LTAnno) or char.get_text() == ' ':
if x != -1:
print('At %r is text: %s' % ((x, y), text))
x, y, text = -1, -1, ''
elif isinstance(char, LTChar):
text += char.get_text()
if x == -1:
x, y, = char.bbox[0], char.bbox[3]
# If the last symbol in the PDF was neither an empty space nor a LTAnno, print the word here
if x != -1:
print('At %r is text: %s' % ((x, y), text))
输出如下所示
At (64.881, 747.903) is text: A
At (90.396, 747.903) is text: Simple
At (180.414, 747.903) is text: PDF
At (241.92, 747.903) is text: File
也许您可以优化条件来检测您的要求和喜好的词。 (例如,在词尾剪掉标点符号 .!?)
读者可能想尝试的另一个包是 pdfparser,它也是基于 Poppler 构建的(使用 Cyton 绑定)并且恰好在性能上得到了更优化
pdfreader pdfminer speed-up factor
tiny document (half page) 0.033s 0.121s 3.6 x
small document (5 pages) 0.141s 0.810s 5.7 x
medium document (55 pages) 1.166s 10.524s 9.0 x
large document (436 pages) 10.581s 108.095s 10.2 x
除了速度更快之外,它的错误处理也更好,并解决了 Pdfminer 遇到的几个问题
import pdfparser.poppler as pdf
import sys
d=pdf.Document(sys.argv[1])
print('No of pages', d.no_of_pages)
for p in d:
print('Page', p.page_no, 'size =', p.size)
for f in p:
print(' '*1,'Flow')
for b in f:
print(' '*2,'Block', 'bbox=', b.bbox.as_tuple())
for l in b:
print(' '*3, l.text.encode('UTF-8'), '(%0.2f, %0.2f, %0.2f, %0.2f)'% l.bbox.as_tuple())
#assert l.char_fonts.comp_ratio < 1.0
for i in range(len(l.text)):
print(l.text[i].encode('UTF-8'), '(%0.2f, %0.2f, %0.2f, %0.2f)'% l.char_bboxes[i].as_tuple(),\
l.char_fonts[i].name, l.char_fonts[i].size, l.char_fonts[i].color,)
print()
你可以清楚地看到源代码是最短的,但仍然提供了所有必要的数据,包括字体颜色、字体大小、字体系列。
更重要的是,您可以直接在块 9 中获得单词,仅比字符高一级)。避免 space 检查逻辑,必须与 Pdfminer 一起使用。