如何使用索引从 PDF 中提取所有文本

Question

我是 Python 的新手，一般来说是编码方面的新手。我正在尝试创建一个程序，该程序将对 PDF 目录进行 OCR，然后提取文本，以便我以后可以挑选出特定的内容。但是，我无法让 pdfPlumber 从所有页面中提取所有文本。您可以从头到尾进行索引，但如果不知道结尾，则会因为索引超出范围而中断。

import ocrmypdf
import os
import requests
import pdfplumber
import re
import logging
import sys
import PyPDF2

## test folder C:\Users\adams\OneDrive\Desktop\PDF

user_direc = input("Enter the path of your files: ") 

#walks the path and prints out each PDF in the 
#OCRs the documents and skips any OCR'd pages.


for dir_name, subdirs, file_list in os.walk(user_direc):
    logging.info(dir_name + '\n')
    os.chdir(dir_name)
    for filename in file_list:
        file_ext = os.path.splitext(filename)[0--1]
        if file_ext == '.pdf':
            full_path = dir_name + '/' + filename
            print(full_path)
result = ocrmypdf.ocr(filename, filename, skip_text=True, deskew = True, optimize = 1) 
logging.info(result)

#the next step is to extract the text from each individual document and print

directory = os.fsencode(user_direc)
    
for file in os.listdir(directory):
    filename = os.fsdecode(file)
    if filename.endswith('.pdf'):
        with pdfplumber.open(file) as pdf:
            page = pdf.pages[0]
            text = page.extract_text()
            print(text)

照原样，这只会从每个 PDF 的第一页中提取文本。我想从每个 PDF 中提取所有文本，但如果我的索引太大并且我不知道 PDF 的页数，pdfPlumber 将会中断。我试过了

page = pdf.pages[0--1]

但这也会中断。我也无法找到 PyPDF2 的解决方法。如果这段代码草率或不可读，我深表歉意。我尝试添加评论来解释我在做什么。

Answer 1

pdfplumber git page 表示 pdfplumber.open returns pdfplumber.PDF class.

的实例

该实例具有 pages 属性，它是 pdfplumber.Page 个实例的列表 - 每个 Page 从您的 pdf 加载一个实例。查看您的代码，如果您这样做：

total_pages = len(pdf.pages)

您应该获得当前加载的 pdf 的总页数。

要将所有 pdf 文本合并为一个巨大的文本字符串，您可以尝试 'for in' 操作。尝试更改现有代码：

for file in os.listdir(directory):
    filename = os.fsdecode(file)
    if filename.endswith('.pdf'):
        with pdfplumber.open(file) as pdf:
            page = pdf.pages[0]
            text = page.extract_text()
            print(text)

收件人：

for file in os.listdir(directory):
    filename = os.fsdecode(file)
    if filename.endswith('.pdf'):
        all_text = '' # new line
        with pdfplumber.open(file) as pdf:
            # page = pdf.pages[0] - comment out or remove line
            # text = page.extract_text() - comment out or remove line
            for pdf_page in pdf.pages:
               single_page_text = pdf_page.extract_text()
               print( single_page_text )
               # separate each page's text with newline
               all_text = all_text + '\n' + single_page_text
            print(all_text)
            # print(text) - comment out or remove line

与其使用页面的索引值 pdf.page[0] 访问各个页面，不如使用 for pdf_page in pdf.pages。它会在到达最后一页后停止循环而不产生异常。您不必担心使用超出范围的索引值。

Answer 2

如果您在尝试上述代码时遇到此错误：

fp = open(path_or_fp, "rb") FileNotFoundError: [Errno 2] No such file or directory:

这是因为 os.listdir() 只给出文件名，你必须将它与目录连接起来。 os.listdir() 函数将 return 相对于您列出的目录的名称。您需要重建打开这些文件的绝对路径。

要解决此错误，请尝试以下代码：

import os
import pdfplumber

directory = r'C:\Users\foo\folder'

for filename in os.listdir(directory):
    if filename.endswith('.pdf'):
        fullpath = os.path.join(directory, filename)
        #print(fullpath)
        all_text = ""
        with pdfplumber.open(fullpath) as pdf:
            for page in pdf.pages:
                text = page.extract_text()
                #print(text)
                all_text += '\n' + text
        print(all_text)

参考：

如何使用索引从 PDF 中提取所有文本

How do I extract all of the text from a PDF using indexing

python

pdf

pypdf2