Python pdfminer LAParams 混合文本输出
Python pdfminer LAParams mixes text output
我有一个 pdf 文件,我想用 pdfminer.The 解析其中的文本 问题是 LAParams 有时会失败,并且在 end.I 处给出该行的某些部分无法弄清楚原因。我的 pdf 看起来像这样:
输出看起来像这样:
我的代码在这里,提前致谢:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec , laparams=laparams)
fp = open(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
caching = True
pagenos=set()
for PageNumer,page in enumerate(PDFPage.get_pages(fp, pagenos , password=password,caching=caching, check_extractable=True)):
interpreter.process_page(page)
text = retstr.getvalue()
fp.close()
device.close()
retstr.close()
return text
print(convert_pdf_to_txt('C:\Users\Vagos\Desktop\europe.pdf'))
自己找到了答案。 LAParams() 具有 word_margin 默认值 0.3 。我的文档显然有时会更大,这会导致问题。替换 LAParams()
用 LAParams(char_margin = 20) 解决了 issue.There 其他变量也见 http://nullege.com/codes/search/pdfminer.layout.LAParams
我有一个 pdf 文件,我想用 pdfminer.The 解析其中的文本 问题是 LAParams 有时会失败,并且在 end.I 处给出该行的某些部分无法弄清楚原因。我的 pdf 看起来像这样:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec , laparams=laparams)
fp = open(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
caching = True
pagenos=set()
for PageNumer,page in enumerate(PDFPage.get_pages(fp, pagenos , password=password,caching=caching, check_extractable=True)):
interpreter.process_page(page)
text = retstr.getvalue()
fp.close()
device.close()
retstr.close()
return text
print(convert_pdf_to_txt('C:\Users\Vagos\Desktop\europe.pdf'))
自己找到了答案。 LAParams() 具有 word_margin 默认值 0.3 。我的文档显然有时会更大,这会导致问题。替换 LAParams() 用 LAParams(char_margin = 20) 解决了 issue.There 其他变量也见 http://nullege.com/codes/search/pdfminer.layout.LAParams