在 python 中阅读和提取 pdf 文件中的文本时，单词之间没有 space？

Question

社区成员大家好，

我想从以 .pdf 为文件扩展名的电子书中提取所有文本。我开始知道 python 有一个包 PyPDF2 来执行必要的操作。不知何故，我已经尝试并能够提取文本，但它会导致提取的单词之间的 space 不合适，有时结果是 2-3 个合并单词的结果。

此外，我想从第 3 页开始提取文本，因为初始页面处理封面和前言。另外，我不想包括最后 5 页，因为它包含词汇表和索引。

有没有其他方法可以读取未加密的 .pdf 二进制文件？

代码片段，目前我尝试过的如下。

import PyPDF2
def Read():
    pdfFileObj = open('book1.pdf','rb')
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
    #discerning the number of pages will allow us to parse through all #the pages
    num_pages = pdfReader.numPages
    count = 0
    global text
    text = []
    while(count < num_pages):
         pageObj = pdfReader.getPage(count)
         count +=1
         text += pageObj.extractText().split()
         print(text)
 Read()

Answer 1

这是一个可能的解决方案：

import PyPDF2

def Read(startPage, endPage):
    global text
    text = []
    cleanText = ""
    pdfFileObj = open('myTest2.pdf', 'rb')
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
    while startPage <= endPage:
        pageObj = pdfReader.getPage(startPage)
        text += pageObj.extractText()
        startPage += 1
    pdfFileObj.close()
    for myWord in text:
        if myWord != '\n':
            cleanText += myWord
    text = cleanText.split()
    print(text)

Read(0,0)

Read() 参数 --> Read(要读取的第一页，要读取的最后一页)

注意：阅读第一页从 0 开始，而不是从 1 开始（例如在数组中）。

在 python 中阅读和提取 pdf 文件中的文本时，单词之间没有 space？

No space between words while reading and extracting the text from a pdf file in python?

pdf

python-3.x

pypdf2