使用 pdfminer python 从 PDF 文件中提取信息
Using pdfminer python to extract information from PDF file
我在 Spyder 中尝试使用 pdfminer
从 PDF 文件中提取某些信息时遇到了问题。我按照 pdfminer
官方文档尝试首先定义提取函数;
# Define a pdf-to-txt function
def pdftotxt(path, new_name):
# Create a pdf parser
parser = PDFParser(path)
# Create an object storing information
document = PDFDocument(parser)
# Evaluate if extractable
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
else:
# Create a PDF resource management to restore resource
resmag = PDFResourceManager()
# Set a parameter for analysis
laparams = LAParams()
# Create a PDF object
# device = PDFDevice(resmag)
device = PDFPageAggregator(resmag,laparams=laparams)
# Create a PDF interpreter
interpreter = PDFPageInterpreter(resmag, device)
# Analyzing each page
for page in PDFPage.create_pages(document):
interpreter.process_page(page)
# Assign LTPage of this page
layout = device.get_result()
for y in layout:
if(isinstance(y,LTTextBoxHorizontal)):
with open("%s"%(new_name),'a',encoding="utf-8") as f:
f.write(y.get_text()+"\n")
# Get a PDF's directory to test
path = open('/keep_2.pdf')
pdftotxt(path, "pdfminer.txt")
但它 returns 一条错误消息:
File "<ipython-input-2-11f054ad4321>", line 31, in <module>
pdftotxt(path, "pdfminer.txt")
File "<ipython-input-2-11f054ad4321>", line 5, in pdftotxt
document = PDFDocument(parser)
File "/Users/WQY/opt/anaconda3/lib/python3.7/site-packages/pdfminer/pdfdocument.py", line 557, in __init__
pos = self.find_xref(parser)
File "/Users/WQY/opt/anaconda3/lib/python3.7/site-packages/pdfminer/pdfdocument.py", line 759, in find_xref
for line in parser.revreadlines():
File "/Users/WQY/opt/anaconda3/lib/python3.7/site-packages/pdfminer/psparser.py", line 268, in revreadlines
n = max(s.rfind(b'\r'), s.rfind(b'\n'))
TypeError: must be str, not bytes
谁能帮忙解决这个错误?我尝试 google 它但似乎没有报告使用 pdfminer
的类似问题。非常感谢您的提前帮助。
发布我的评论作为答案,这样对于滚动浏览的人来说这看起来不像是一个未回答的问题:
而不是 open('/keep_2.pdf')
,使用 open('/keep_2.pdf', 'rb')
以二进制模式打开。
我在 Spyder 中尝试使用 pdfminer
从 PDF 文件中提取某些信息时遇到了问题。我按照 pdfminer
官方文档尝试首先定义提取函数;
# Define a pdf-to-txt function
def pdftotxt(path, new_name):
# Create a pdf parser
parser = PDFParser(path)
# Create an object storing information
document = PDFDocument(parser)
# Evaluate if extractable
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
else:
# Create a PDF resource management to restore resource
resmag = PDFResourceManager()
# Set a parameter for analysis
laparams = LAParams()
# Create a PDF object
# device = PDFDevice(resmag)
device = PDFPageAggregator(resmag,laparams=laparams)
# Create a PDF interpreter
interpreter = PDFPageInterpreter(resmag, device)
# Analyzing each page
for page in PDFPage.create_pages(document):
interpreter.process_page(page)
# Assign LTPage of this page
layout = device.get_result()
for y in layout:
if(isinstance(y,LTTextBoxHorizontal)):
with open("%s"%(new_name),'a',encoding="utf-8") as f:
f.write(y.get_text()+"\n")
# Get a PDF's directory to test
path = open('/keep_2.pdf')
pdftotxt(path, "pdfminer.txt")
但它 returns 一条错误消息:
File "<ipython-input-2-11f054ad4321>", line 31, in <module>
pdftotxt(path, "pdfminer.txt")
File "<ipython-input-2-11f054ad4321>", line 5, in pdftotxt
document = PDFDocument(parser)
File "/Users/WQY/opt/anaconda3/lib/python3.7/site-packages/pdfminer/pdfdocument.py", line 557, in __init__
pos = self.find_xref(parser)
File "/Users/WQY/opt/anaconda3/lib/python3.7/site-packages/pdfminer/pdfdocument.py", line 759, in find_xref
for line in parser.revreadlines():
File "/Users/WQY/opt/anaconda3/lib/python3.7/site-packages/pdfminer/psparser.py", line 268, in revreadlines
n = max(s.rfind(b'\r'), s.rfind(b'\n'))
TypeError: must be str, not bytes
谁能帮忙解决这个错误?我尝试 google 它但似乎没有报告使用 pdfminer
的类似问题。非常感谢您的提前帮助。
发布我的评论作为答案,这样对于滚动浏览的人来说这看起来不像是一个未回答的问题:
而不是 open('/keep_2.pdf')
,使用 open('/keep_2.pdf', 'rb')
以二进制模式打开。