使用 Python 和 pyPDF 提取前两行 PDF

Question

我正在使用 python 2.7 和 pyPDF 从 PDF 文件中获取标题元信息。不幸的是，并非所有 PDF 都有元信息。我现在想做的是从 PDF 中获取前两行文本。使用我现在拥有的如何修改代码以使用 pyPDF 捕获前两行？

from pyPdf import PdfFileWriter, PdfFileReader
import os

for fileName in os.listdir('.'):
    try:
        if fileName.lower()[-3:] != "pdf": continue
        input1 = PdfFileReader(file(fileName, "rb"))

        # print the title of document1.pdf
        print fileName, input1.getDocumentInfo().title
    except:
        print ",",

Answer 1

from PyPDF2 import PdfFileWriter, PdfFileReader
import os
import StringIO

fileName = "HMM.pdf"
try:
        if fileName.lower()[-3:] == "pdf": 
            input1 = PdfFileReader(file(fileName, "rb"))

            # print the title of document1.pdf
            #print fileName, input1.getDocumentInfo().title

            content = input1.getPage(0).extractText()
            buf = StringIO.StringIO(content)
            buf.readline()
            buf.readline()

except:
        print ",",

我的密码包含此 "HMM.pdf" 文件，此代码在 python 2.7 上正常运行。

使用 Python 和 pyPDF 提取前两行 PDF

Extract first two lines of PDF with Python and pyPDF

python

pypdf

python-2.7