从布局与复制粘贴相同的 PDF 文件中获取数据

Question

我有一个我希望自动化的程序，它涉及从 PDF 文件中获取一系列 tables。目前，我可以通过在任何查看器（Adobe、Sumatra、okular 等）中打开文件，然后按 Ctrl+A、Ctrl+C、Ctrl+V 将其打开到记事本，它会保持每一行与合理的对齐足够的格式，然后我可以运行一个正则表达式并将其复制并粘贴到 Excel 中，以备后用。

当尝试使用 python 执行此操作时，我尝试了各种模块，PDFminer 是主要的模块，它通过使用 this example for instance. But it returns the data in a single column. Other options include just getting it as an html table 来工作，但在这种情况下，它增加了额外的拆分 mid-table 这使得解析更加复杂，甚至偶尔会在第一页和第二页之间切换列。

我现在有一个临时解决方案，但我担心我可能只是在解析器中缺少核心选项或者我需要考虑一些基本选项时重新发明轮子PDF 渲染器解决此问题的方式。

关于如何处理它的任何想法？

Answer 1

我最终实施了一个基于 this one, by itself modified from a code by tgray 的解决方案。到目前为止，它在我测试过的所有情况下都能正常工作，但我还没有确定如何直接操作 pdfminer 的参数以获得所需的行为。

Answer 2

发布此内容只是为了获得一段与 py35 一起用于 csv-like 解析的代码。列中的拆分是最简单的，但对我有用。

Crudos以tgray在此answer为起点。

也放入 openpyxl，因为我更喜欢直接在 excel 中得到结果。

# works with py35 & pip-installed pdfminer.six in 2017
def pdf_to_csv(filename):
    from io import StringIO
    from pdfminer.converter import LTChar, TextConverter
    from pdfminer.layout import LAParams
    from pdfminer.pdfdocument import PDFDocument
    from pdfminer.pdfpage import PDFPage
    from pdfminer.pdfparser import PDFParser
    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter

    class CsvConverter(TextConverter):
        def __init__(self, *args, **kwargs):
            TextConverter.__init__(self, *args, **kwargs)

        def end_page(self, i):
            from collections import defaultdict
            lines = defaultdict(lambda : {})
            for child in self.cur_item._objs:
                if isinstance(child, LTChar):
                    (_,_,x,y) = child.bbox
                    line = lines[int(-y)]
                    line[x] = child.get_text()
                    # the line is now an unsorted dict

            for y in sorted(lines.keys()):
                line = lines[y]
                # combine close letters to form columns
                xpos = tuple(sorted(line.keys()))
                new_line = []
                temp_text = ''
                for i in range(len(xpos)-1):
                    temp_text += line[xpos[i]]
                    if xpos[i+1] - xpos[i] > 8:
                        # the 8 is representing font-width
                        # needs adjustment for your specific pdf
                        new_line.append(temp_text)
                        temp_text = ''
                # adding the last column which also manually needs the last letter
                new_line.append(temp_text+line[xpos[-1]])

                self.outfp.write(";".join(nl for nl in new_line))
                self.outfp.write("\n")

    # ... the following part of the code is a remix of the 
    # convert() function in the pdfminer/tools/pdf2text module
    rsrc = PDFResourceManager()
    outfp = StringIO()
    device = CsvConverter(rsrc, outfp, codec="utf-8", laparams=LAParams())

    fp = open(filename, 'rb')
    parser = PDFParser(fp)
    doc = PDFDocument(parser)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    interpreter = PDFPageInterpreter(rsrc, device)

    for i, page in enumerate(PDFPage.get_pages(fp,
                                pagenos, maxpages=maxpages,
                                password=password,caching=caching,
                                check_extractable=True)):
        outfp.write("START PAGE %d\n" % i)
        if page is not None:
            interpreter.process_page(page)
        outfp.write("END PAGE %d\n" % i)

    device.close()
    fp.close()

    return outfp.getvalue()

fn = 'your_file.pdf'
result = pdf_to_csv(fn)

lines = result.split('\n')
import openpyxl as pxl
wb = pxl.Workbook()
ws = wb.active
for line in lines:
    ws.append(line.split(';'))
    # appending a list gives a complete row in xlsx
wb.save('your_file.xlsx')

从布局与复制粘贴相同的 PDF 文件中获取数据

Obtaining data from a PDF file with the same layout as with a copy+paste

python

pdf

pdfminer