从 pdf 中提取表格

Question

我正在尝试从 PDF 中的 table 中获取数据。我试过 pdfminer 和 pypdf 有点运气，但我无法真正从 tables.

中获取数据

这是 table 之一的样子：

如您所见，某些列标有 'x'。我正在尝试将此 table 放入 objects 的列表中。

这是目前为止的代码，我现在正在使用 pdfminer。

# pdfminer test
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice, TagExtractor
from pdfminer.pdfpage import PDFPage, PDFTextExtractionNotAllowed
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter, PDFPageAggregator
from pdfminer.cmapdb import CMapDB
from pdfminer.layout import LAParams, LTTextBox, LTTextLine, LTFigure, LTImage
from pdfminer.image import ImageWriter
from cStringIO import StringIO
import sys
import os


def pdfToText(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ''
    maxpages = 0
    caching = True
    pagenos = set()

    records = []
    i = 1
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,
                                  caching=caching, check_extractable=True):
        # process page
        interpreter.process_page(page)

        # only select lines from the line containing 'Tool' to the line containing "1 The 'All'"
        lines = retstr.getvalue().splitlines()

        idx = containsSubString(lines, 'Tool')
        lines = lines[idx+1:]
        idx = containsSubString(lines, "1 The 'All'")
        lines = lines[:idx]

        for line in lines:
            records.append(line)
        i += 1

    fp.close()
    device.close()
    retstr.close()

    return records


def containsSubString(list, substring):
    # find a substring in a list item
    for i, s in enumerate(list):
        if substring in s:
            return i
    return -1


# process pdf
fn = '../test1.pdf'
ft = 'test.txt'

text = pdfToText(fn)
outFile = open(ft, 'w')
for i in range(0, len(text)):
    outFile.write(text[i])
outFile.close()

这会生成一个文本文件并获取所有文本，但是 x 没有保留间距。输出如下所示：

文本文档中的 x 只是单倍行距

现在，我只是在生成文本输出，但我的目标是用 table 中的数据生成一个 html 文档。我一直在寻找 OCR 示例，其中大多数看起来令人困惑或不完整。我愿意使用 C# 或任何其他可能产生我正在寻找的结果的语言。

编辑： 会有多个这样的 pdf，我需要从中获取 table 数据。 headers 对于所有 pdf 都是一样的（据我所知）。

Answer 1

尝试Tabula and if it works use tabula-extractor library（写在ruby）以编程方式提取数据。

Answer 2

我想通了，我走错了方向。我所做的是在 pdf 中为每个 table 创建 png，现在我正在使用 opencv & python.

处理图像

从 pdf 中提取表格

Extracting tables from a pdf

python

ocr

pdf-parsing

python-2.7

pdfminer