使用 pdfminer 转换多个文件
convert several files with pdfminer
我在网上找到了代码,可以使用 Python 中的 pdfminer
模块将多个 pdf 文件转换为文本文件。我试图扩展保存在目录中的几个 pdf 文件的代码,但代码导致错误。
到目前为止我的代码:
import nltk
import re
import glob
from cStringIO import StringIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
def convert(fname, pages=None):
if not pages:
pagenums = set()
else:
pagenums = set(pages)
output = StringIO()
manager = PDFResourceManager()
converter = TextConverter(manager, output, laparams=LAParams())
interpreter = PDFPageInterpreter(manager, converter)
infile = file(fname, 'rb')
for page in PDFPage.get_pages(infile, pagenums):
interpreter.process_page(page)
infile.close()
converter.close()
text = output.getvalue()
output.close
with open('D:\Reports\*.txt', 'w') as pdf_file:
pdf_file.write(text)
return text
directory = glob.glob('D:\Reports\*.pdf')
for myfiles in directory:
convert(myfiles)
错误信息:
Traceback (most recent call last):
File "F:/Text mining/pdfminer for several files", line 40, in <module>
convert(myfiles)
File "F:/Text mining/pdfminer for several files", line 32, in convert
with open('D:\Reports\*.txt', 'w') as pdf_file:
IOError: [Errno 22] invalid mode ('w') or filename: 'D:\Reports\*.txt'
也许你应该改变:
with open('D:\Reports\*.txt', 'w') as pdf_file:
pdf_file.write(text)
至
with open(fname, 'w') as pdf_file:
pdf_file.write(text)
但我的机器上没有 python2.7-3.4 可用于验证
错误源于试图将 text
变量的内容写入名为 'D:\Reports\*.txt'
的文件。文件名 (ref) 中不允许使用通配符 *
。
如果您想将文件保存为同名的文本文件,您可以将您的书写功能替换为:
outfile = os.path.splitext(os.path.abspath(fname))[0] + '.txt'
with open(outfile, 'wb') as pdf_file:
pdf_file.write(text)
如果您想以 OS 不可知的方式处理路径,请不要忘记 import os
。
我在网上找到了代码,可以使用 Python 中的 pdfminer
模块将多个 pdf 文件转换为文本文件。我试图扩展保存在目录中的几个 pdf 文件的代码,但代码导致错误。
到目前为止我的代码:
import nltk
import re
import glob
from cStringIO import StringIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
def convert(fname, pages=None):
if not pages:
pagenums = set()
else:
pagenums = set(pages)
output = StringIO()
manager = PDFResourceManager()
converter = TextConverter(manager, output, laparams=LAParams())
interpreter = PDFPageInterpreter(manager, converter)
infile = file(fname, 'rb')
for page in PDFPage.get_pages(infile, pagenums):
interpreter.process_page(page)
infile.close()
converter.close()
text = output.getvalue()
output.close
with open('D:\Reports\*.txt', 'w') as pdf_file:
pdf_file.write(text)
return text
directory = glob.glob('D:\Reports\*.pdf')
for myfiles in directory:
convert(myfiles)
错误信息:
Traceback (most recent call last):
File "F:/Text mining/pdfminer for several files", line 40, in <module>
convert(myfiles)
File "F:/Text mining/pdfminer for several files", line 32, in convert
with open('D:\Reports\*.txt', 'w') as pdf_file:
IOError: [Errno 22] invalid mode ('w') or filename: 'D:\Reports\*.txt'
也许你应该改变:
with open('D:\Reports\*.txt', 'w') as pdf_file:
pdf_file.write(text)
至
with open(fname, 'w') as pdf_file:
pdf_file.write(text)
但我的机器上没有 python2.7-3.4 可用于验证
错误源于试图将 text
变量的内容写入名为 'D:\Reports\*.txt'
的文件。文件名 (ref) 中不允许使用通配符 *
。
如果您想将文件保存为同名的文本文件,您可以将您的书写功能替换为:
outfile = os.path.splitext(os.path.abspath(fname))[0] + '.txt'
with open(outfile, 'wb') as pdf_file:
pdf_file.write(text)
如果您想以 OS 不可知的方式处理路径,请不要忘记 import os
。