使用pypdf2提取文本的正确方法是什么

Question

我正在尝试从 pdf 文件中提取文本。我为此任务使用以下代码：

def get_pdf_text(file):
    pdffile = PyPDF2.PdfFileReader(file)
    numpages = pdffile.getNumPages()
    for pages in range(0,numpages):
        currpage = pdffile.getPage(pages)
        content = currpage.extractText().encode('UTF-8')
    return content

但是，我得到的输出与源文件有很大不同：

b'Inheritance is a basic concept of Object\n-\nOriented Programming where\n \nthe basic idea is to create new classes that add extra detail to\n \nexisting classes.
 This is done by allowing the new classes to reuse\n \nthe methods and variables of the existing classes and new methods and\n \nclasses are added to specialise the new class.
 Inheritance models the\n \n\n-\nkind\n-\n\nbjects), for example,\n \npostgraduates and undergraduates are both kinds of student. This kind\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \nclass exten\n\n \n \nInheritance can occur on several layers, where if visualised would\n \ndisplay a larger tree structure. For example, we could further extend\n \n\n \n\n\n \n\n \n\n \n\n \n'

不仅有多个 \n 出现在意想不到的位置，而且某些内容似乎也丢失了。我似乎找不到解决方法。预先感谢您的帮助

Answer 1

问题出在您的 pdf 文件中。我复制了您的文本并创建了 another pdf file，现在可以使用了。
返回前添加str()
使用print(pdf_text)

修改后的代码如下：

import PyPDF2

def get_pdf_text(file):
    pdffile = PyPDF2.PdfFileReader(file)
    numpages = pdffile.getNumPages()
    for pages in range(0,numpages):
        currpage = pdffile.getPage(pages)
        content = str(currpage.extractText())
    return content

print(get_pdf_text('Untitled.pdf'))

输出：

'Inheritance is a basic concept of Object-Oriented Programming where the 
basic idea is to create new classes that add extra detail to existing classes. 
This is done by allowing the new classes to reuse the methods and 
variables of the existing classes and new methods and classes are added to 
specialise the new class. Inheritance models the Òis-kind-ofÓ relationship 
between entities (or objects), for example, postgraduates and 
undergraduates are both kinds of student. This kind of relationship can be 
visualised as a tree structure, where ÔstudentÕ would be the more general 
root node and both ÔpostgraduateÕ and ÔundergraduateÕ would be more 
specialised extensions of the ÔstudentÕ node (or the child nodes). In this 
relationship ÔstudentÕ would be 
known as the superclass or parent class whereas, ÔpostgraduateÕ would be 
known as the subclass or child class because the ÔpostgraduateÕ class 
extends the ÔstudentÕ class. 
Inheritance can occur on several layers, where if visualised would display 
a larger tree structure. For example, we could further extend the 
ÔpostgraduateÕ node by adding two extra extended classes to it called, 
ÔMSc StudentÕ and ÔPhD StudentÕ as both these types of student are kinds 
of postgraduate student. This would mean that both the ÔMSc StudentÕ and 
ÔPhD StudentÕ classes would inherit methods and variables from both the 
ÔpostgraduateÕ and Ôstudent classesÕ. '

使用pypdf2提取文本的正确方法是什么

What is the correct way of extracting texts using pypdf2

python

pypdf2