使用 Python 和 PyPDF2 从 PDF 文件中提取文本

Extract text from PDF File using Python with PyPDF2

我想从给定的 PDF 中提取文本。

使用的代码是:

from PyPDF2 import PdfFileReader
def extract_information(pdf_path):
    with open(pdf_path, 'rb') as f:
        pdf = PdfFileReader(f)
        number_of_pages = pdf.getNumPages()
        for pages in range(number_of_pages):
            page=pdf.getPage(pages)
            page_content=page.extractText()
            print(page_content)
 

if __name__ == '__main__':
    path = 'test.pdf'
    extract_information(path)

但是当我 运行 上面的代码时,我得到以下输出:

PS E:\Omkar\Coding\Python\pdfSearch> python .\scrape.py
 !"#$%&!'()*+&,$ !")-!+)-. !"#$%$&'$%%()%*)(+(+$,-,.-+/ 0 1234#5$&3-6#3#1!4#5-[=12=]#5"#3:;;#<$=-$%(+,(>(?/0&1(+(3)-4!+&)(@15#123"$ A8B-C9D;E:F0G$;@HFI%*,JJ>*%J/H F=-D2K#3B#=->.J*EKK4=- 1#L#342L#$M!152!K$M!1#$M&1NO?JP%%$D9QQ9;IR$SDTC$*E
;FM:0@HC$:FDDG$HU$%%/%?
V>%?W*%JPJ?++ A&3#=%(+,(>(?X:ED@@G[=12=]FM:E9D%(+,(>(?X:ED@@G[=12=]FM:E9D%(+,(>(?X:ED@@G[=12=]FM:E9D%(+,(>(?X:ED@@G[=12=]FM:E9D%(+,(>(?X:ED@@G[=12=]FM:E9D%(+,(>(?X:ED@@G[=12=]FM:E9DQ!Y=V?,,W>J/P/*,/H!Z#-X:ED@@G[=12=]FM:E9DR#Y-[=12=]C@S-$+*)%+)%..* A&3#-$*/>,,J(?*>F3$M!1#$@'-X:ED@@G[=12=]FM:E9D$
E551#BB-(*?$M9CE;:[;RI$ET9$%S
 !42#34$FC-$,.>>J>?C2!"$M&5#B-M&N8$;#N&14\(+O?(?\>%O.
C!4#$M&]]#K4#5-I2Z#$M&]]#K4#5-
Q!B423"-$^_I2Z#5$[123#$M&]]#K42&3-$^$$$_
H&3$'!B423"-$^`$_T&]aZ#-
M!]]$;#Ba]4B-$^$$$_M&ZZ#34B- !42#34-F3Ba1!3K#-M]2#34-0#52K!25-0#52K!1#-;!2]1&!5[=12=]M;-
F3Ba1#5$H!Z#-F3Ba1!3K#$ ]!3-9ZN]&8#1)61&aN$H!Z#- &]2K8=-61&aN) ]!3=-%()%+)(+%%-+>$!Z
`X:ED@@G[=12=]FM:E9D$
;#]!42&3BA2N-R#]'bXJJ>(,H$$!+&1(+(3)-4!+&)(2(*6-!(,1(3)-4!+&)(%&!'()*&*$/)71*891,&41((3)-4!+&)(;VRRIW6US6;UDSMVS]&&5$Ma]4W
:-71-17$;1*+*
M9CE;:[;RI$HU$%%,%J
09I;@ D[R[=12=]MC$^/>>%(_$ O@O$S@`$%.JJ$H9c$U@;X$HU$
%+%%J%.JJ
OM@TFC%.$RE;RPM@T($$`$$(+(/ A8O$H!Z#-C9D;E:F0G$;@HFI A8B2K2!3$$R2"3!4a1#-

我认为这一定与 PDF 中使用的编码有关,但我无法理解。

link to the pdf used

提前致谢。

我使用 pdfminer 提取 pdf。您可以参考示例代码。

#pip install pdfminer.six
import io

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage


def convert_pdf_to_txt(path):
    '''Convert pdf content from a file path to text

    :path the file path
    '''
    rsrcmgr = PDFResourceManager()
    codec = 'utf-8'
    laparams = LAParams()

    with io.StringIO() as retstr:
        with TextConverter(rsrcmgr, retstr, codec=codec,
                           laparams=laparams) as device:
            with open(path, 'rb') as fp:
                interpreter = PDFPageInterpreter(rsrcmgr, device)
                password = ""
                maxpages = 0
                caching = True
                pagenos = set()

                for page in PDFPage.get_pages(fp,
                                              pagenos,
                                              maxpages=maxpages,
                                              password=password,
                                              caching=caching,
                                              check_extractable=True):
                    interpreter.process_page(page)

                return retstr.getvalue()


if __name__ == "__main__":
    print(convert_pdf_to_txt('test.pdf')) 

有关此盖子的更多信息。您可以参考下面link

PDFminer

要从 PDF 中提取文本,您需要使用 OCR,在我看来最好的 OCR 是由 Google 开发的 Tesseract OCR,您只需安装 pytesseract 并像在 pdf 上使用它一样使用它,但我高度评价建议与 openCV 一起使用,以便仅在文本上使用 OCR

https://towardsdatascience.com/extracting-text-from-scanned-pdf-using-pytesseract-open-cv-cd670ee38052