使用pypdf2提取文本的正确方法是什么
What is the correct way of extracting texts using pypdf2
我正在尝试从 pdf 文件中提取文本。我为此任务使用以下代码:
def get_pdf_text(file):
pdffile = PyPDF2.PdfFileReader(file)
numpages = pdffile.getNumPages()
for pages in range(0,numpages):
currpage = pdffile.getPage(pages)
content = currpage.extractText().encode('UTF-8')
return content
但是,我得到的输出与源文件有很大不同:
b'Inheritance is a basic concept of Object\n-\nOriented Programming where\n \nthe basic idea is to create new classes that add extra detail to\n \nexisting classes.
This is done by allowing the new classes to reuse\n \nthe methods and variables of the existing classes and new methods and\n \nclasses are added to specialise the new class.
Inheritance models the\n \n\n-\nkind\n-\n\nbjects), for example,\n \npostgraduates and undergraduates are both kinds of student. This kind\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \nclass exten\n\n \n \nInheritance can occur on several layers, where if visualised would\n \ndisplay a larger tree structure. For example, we could further extend\n \n\n \n\n\n \n\n \n\n \n\n \n'
不仅有多个 \n
出现在意想不到的位置,而且某些内容似乎也丢失了。
我似乎找不到解决方法。
预先感谢您的帮助
问题出在您的 pdf 文件中。我复制了您的文本并创建了 another pdf file,现在可以使用了。
返回前添加str()
使用print(pdf_text)
修改后的代码如下:
import PyPDF2
def get_pdf_text(file):
pdffile = PyPDF2.PdfFileReader(file)
numpages = pdffile.getNumPages()
for pages in range(0,numpages):
currpage = pdffile.getPage(pages)
content = str(currpage.extractText())
return content
print(get_pdf_text('Untitled.pdf'))
输出:
'Inheritance is a basic concept of Object-Oriented Programming where the
basic idea is to create new classes that add extra detail to existing classes.
This is done by allowing the new classes to reuse the methods and
variables of the existing classes and new methods and classes are added to
specialise the new class. Inheritance models the Òis-kind-ofÓ relationship
between entities (or objects), for example, postgraduates and
undergraduates are both kinds of student. This kind of relationship can be
visualised as a tree structure, where ÔstudentÕ would be the more general
root node and both ÔpostgraduateÕ and ÔundergraduateÕ would be more
specialised extensions of the ÔstudentÕ node (or the child nodes). In this
relationship ÔstudentÕ would be
known as the superclass or parent class whereas, ÔpostgraduateÕ would be
known as the subclass or child class because the ÔpostgraduateÕ class
extends the ÔstudentÕ class.
Inheritance can occur on several layers, where if visualised would display
a larger tree structure. For example, we could further extend the
ÔpostgraduateÕ node by adding two extra extended classes to it called,
ÔMSc StudentÕ and ÔPhD StudentÕ as both these types of student are kinds
of postgraduate student. This would mean that both the ÔMSc StudentÕ and
ÔPhD StudentÕ classes would inherit methods and variables from both the
ÔpostgraduateÕ and Ôstudent classesÕ. '
我正在尝试从 pdf 文件中提取文本。我为此任务使用以下代码:
def get_pdf_text(file):
pdffile = PyPDF2.PdfFileReader(file)
numpages = pdffile.getNumPages()
for pages in range(0,numpages):
currpage = pdffile.getPage(pages)
content = currpage.extractText().encode('UTF-8')
return content
但是,我得到的输出与源文件有很大不同:
b'Inheritance is a basic concept of Object\n-\nOriented Programming where\n \nthe basic idea is to create new classes that add extra detail to\n \nexisting classes.
This is done by allowing the new classes to reuse\n \nthe methods and variables of the existing classes and new methods and\n \nclasses are added to specialise the new class.
Inheritance models the\n \n\n-\nkind\n-\n\nbjects), for example,\n \npostgraduates and undergraduates are both kinds of student. This kind\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \nclass exten\n\n \n \nInheritance can occur on several layers, where if visualised would\n \ndisplay a larger tree structure. For example, we could further extend\n \n\n \n\n\n \n\n \n\n \n\n \n'
不仅有多个 \n
出现在意想不到的位置,而且某些内容似乎也丢失了。
我似乎找不到解决方法。
预先感谢您的帮助
问题出在您的 pdf 文件中。我复制了您的文本并创建了 another pdf file,现在可以使用了。
返回前添加
str()
使用
print(pdf_text)
修改后的代码如下:
import PyPDF2
def get_pdf_text(file):
pdffile = PyPDF2.PdfFileReader(file)
numpages = pdffile.getNumPages()
for pages in range(0,numpages):
currpage = pdffile.getPage(pages)
content = str(currpage.extractText())
return content
print(get_pdf_text('Untitled.pdf'))
输出:
'Inheritance is a basic concept of Object-Oriented Programming where the
basic idea is to create new classes that add extra detail to existing classes.
This is done by allowing the new classes to reuse the methods and
variables of the existing classes and new methods and classes are added to
specialise the new class. Inheritance models the Òis-kind-ofÓ relationship
between entities (or objects), for example, postgraduates and
undergraduates are both kinds of student. This kind of relationship can be
visualised as a tree structure, where ÔstudentÕ would be the more general
root node and both ÔpostgraduateÕ and ÔundergraduateÕ would be more
specialised extensions of the ÔstudentÕ node (or the child nodes). In this
relationship ÔstudentÕ would be
known as the superclass or parent class whereas, ÔpostgraduateÕ would be
known as the subclass or child class because the ÔpostgraduateÕ class
extends the ÔstudentÕ class.
Inheritance can occur on several layers, where if visualised would display
a larger tree structure. For example, we could further extend the
ÔpostgraduateÕ node by adding two extra extended classes to it called,
ÔMSc StudentÕ and ÔPhD StudentÕ as both these types of student are kinds
of postgraduate student. This would mean that both the ÔMSc StudentÕ and
ÔPhD StudentÕ classes would inherit methods and variables from both the
ÔpostgraduateÕ and Ôstudent classesÕ. '