如何使用 PDFminer 避免密码错误的 PDF 文件
How to avoid PDF files with incorrect password error using PDFminer
我想从我的计算机收集所有 PDF 文件并从每个文件中提取文本。我目前使用的两个功能都是这样做的,但是,一些 PDF 文件给我这个错误:
raise PDFPasswordIncorrect
pdfminer.pdfdocument.PDFPasswordIncorrect
我在打开和读取 PDF 文件的功能中提出了错误,这似乎在忽略错误方面起作用,但现在它忽略了所有 PDF 文件,包括以前不是问题的好文件。
我怎样才能让它只忽略给我这个错误的 PDF 文件而不是每个 PDF?
def pdfparser(x):
try:
raise PDFPasswordIncorrect(pdfminer.pdfdocument.PDFPasswordIncorrect)
fp = open(x, 'rb')
rsrcmgr = PDFResourceManager()
retstr = io.StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
except (RuntimeError, TypeError, NameError,ValueError,IOError,IndexError,PermissionError):
print("Error processing {}".format(name))
for page in PDFPage.get_pages(fp):
interpreter.process_page(page)
data = retstr.getvalue()
return(data)
def pdfs(files):
for name in files:
try:
IP_list = (pdfparser(name))
keyword = re.findall(inp,IP_list)
file_dict['keyword'].append(keyword)
file_dict['name'].append(name.name[0:])
file_dict['created'].append(time.ctime(name.stat().st_ctime))
file_dict['modified'].append(time.ctime(name.stat().st_mtime))
file_dict['path'].append(name)
file_dict["content"].append(IP_list)
except (RuntimeError, TypeError, NameError,ValueError,IOError,IndexError,PermissionError):
print("Error processing {}".format(name))
#print(file_dict)
return(file_dict)
pdfs(files)
为什么 如果您打开受密码保护的 Pdf,如果您没有提供正确的密码,您会手动引发错误吗?
您的代码每次都会引发此错误!
相反,您需要在发生错误时捕获错误并跳过该文件。查看更正后的代码:
def pdfparser(x):
try:
# try to open your pdf here - do not raise the error yourself!
# if it happens, catch and handle it as well
except PDFPasswordIncorrect as e: # catch PDFPasswordIncorrect
print("Error processing {}: {}".format(name,e)) # with all other errors
# no sense in doing anything if you got an error until here
return None
# do something with your pdf and collect data
data = []
return(data)
def pdfs(files):
for name in files:
try:
IP_list = pdfparser(name)
if IP_list is None: # unable to read for whatever reasons
continue # process next file
# do stuff with your data if you got some
# most of these errors are already handled inside pdfparser
except (RuntimeError, TypeError, NameError,ValueError,
IOError,IndexError,PermissionError):
print("Error processing {}".format(name))
return(file_dict)
pdfs(files)
def pdfs(files):
中的第二个 try/catch:
可以缩小,所有与文件相关的错误都发生在 def pdfparser(x):
中并在那里处理。您的其余代码不完整,引用了我不知道的内容:
file_dict
inp
name # used as filehandle for .stat() but is a string etc
我想从我的计算机收集所有 PDF 文件并从每个文件中提取文本。我目前使用的两个功能都是这样做的,但是,一些 PDF 文件给我这个错误:
raise PDFPasswordIncorrect
pdfminer.pdfdocument.PDFPasswordIncorrect
我在打开和读取 PDF 文件的功能中提出了错误,这似乎在忽略错误方面起作用,但现在它忽略了所有 PDF 文件,包括以前不是问题的好文件。
我怎样才能让它只忽略给我这个错误的 PDF 文件而不是每个 PDF?
def pdfparser(x):
try:
raise PDFPasswordIncorrect(pdfminer.pdfdocument.PDFPasswordIncorrect)
fp = open(x, 'rb')
rsrcmgr = PDFResourceManager()
retstr = io.StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
except (RuntimeError, TypeError, NameError,ValueError,IOError,IndexError,PermissionError):
print("Error processing {}".format(name))
for page in PDFPage.get_pages(fp):
interpreter.process_page(page)
data = retstr.getvalue()
return(data)
def pdfs(files):
for name in files:
try:
IP_list = (pdfparser(name))
keyword = re.findall(inp,IP_list)
file_dict['keyword'].append(keyword)
file_dict['name'].append(name.name[0:])
file_dict['created'].append(time.ctime(name.stat().st_ctime))
file_dict['modified'].append(time.ctime(name.stat().st_mtime))
file_dict['path'].append(name)
file_dict["content"].append(IP_list)
except (RuntimeError, TypeError, NameError,ValueError,IOError,IndexError,PermissionError):
print("Error processing {}".format(name))
#print(file_dict)
return(file_dict)
pdfs(files)
为什么 如果您打开受密码保护的 Pdf,如果您没有提供正确的密码,您会手动引发错误吗?
您的代码每次都会引发此错误!
相反,您需要在发生错误时捕获错误并跳过该文件。查看更正后的代码:
def pdfparser(x):
try:
# try to open your pdf here - do not raise the error yourself!
# if it happens, catch and handle it as well
except PDFPasswordIncorrect as e: # catch PDFPasswordIncorrect
print("Error processing {}: {}".format(name,e)) # with all other errors
# no sense in doing anything if you got an error until here
return None
# do something with your pdf and collect data
data = []
return(data)
def pdfs(files):
for name in files:
try:
IP_list = pdfparser(name)
if IP_list is None: # unable to read for whatever reasons
continue # process next file
# do stuff with your data if you got some
# most of these errors are already handled inside pdfparser
except (RuntimeError, TypeError, NameError,ValueError,
IOError,IndexError,PermissionError):
print("Error processing {}".format(name))
return(file_dict)
pdfs(files)
def pdfs(files):
中的第二个 try/catch:
可以缩小,所有与文件相关的错误都发生在 def pdfparser(x):
中并在那里处理。您的其余代码不完整,引用了我不知道的内容:
file_dict inp name # used as filehandle for .stat() but is a string etc