将提取的文本布局保留在 pdfminer.six python 中
Keep Layout of extracted text in pdfminer.six python
我想提取此 pdf 的文本:https://github.com/pdfminer/pdfminer.six/files/1887670/Wochenkarte-KW-15-Neu.pdf
当我使用这段代码提取文本时:
def convert_pdf_to_txt(path):
resource_manager = PDFResourceManager()
device = None
try:
with StringIO() as string_writer, open(path, 'rb') as pdf_file:
device = TextConverter(resource_manager, string_writer, codec='utf-8', laparams=LAParams())
interpreter = PDFPageInterpreter(resource_manager, device)
for page in PDFPage.get_pages(pdf_file, maxpages=1):
interpreter.process_page(page)
pdf_text = string_writer.getvalue()
finally:
if device:
device.close()
return pdf_text
文本与 pdf 的文本布局不符。
当前结果:
Montag 09.04.2018
Menü 1
Kl. Salat
Menü 2
Kl. Salat
Seelachs-Spinat-Türmchen mit Spinat-
Masalla-Sauce und Reis
Currywurst mit Pommes
预期结果:
Montag 09.04.2018
Menü 1
Kl. Salat Seelachs-Spinat-Türmchen mit Spinat-Masalla-Sauce und Reis
Menü 2
Kl. Salat Currywurst mit Pommes
我做错了什么或者我错过了什么?
关键是在 LAParams 中给出另一个 linemargin:
LAParams(line_margin=0.1)
我的线路现在看起来像这样:
device = TextConverter(resource_manager, string_writer, codec='utf-8', laparams=LAParams(line_margin=0.1))
感谢Tim
我想提取此 pdf 的文本:https://github.com/pdfminer/pdfminer.six/files/1887670/Wochenkarte-KW-15-Neu.pdf
当我使用这段代码提取文本时:
def convert_pdf_to_txt(path):
resource_manager = PDFResourceManager()
device = None
try:
with StringIO() as string_writer, open(path, 'rb') as pdf_file:
device = TextConverter(resource_manager, string_writer, codec='utf-8', laparams=LAParams())
interpreter = PDFPageInterpreter(resource_manager, device)
for page in PDFPage.get_pages(pdf_file, maxpages=1):
interpreter.process_page(page)
pdf_text = string_writer.getvalue()
finally:
if device:
device.close()
return pdf_text
文本与 pdf 的文本布局不符。 当前结果:
Montag 09.04.2018
Menü 1
Kl. Salat
Menü 2
Kl. Salat
Seelachs-Spinat-Türmchen mit Spinat-
Masalla-Sauce und Reis
Currywurst mit Pommes
预期结果:
Montag 09.04.2018
Menü 1
Kl. Salat Seelachs-Spinat-Türmchen mit Spinat-Masalla-Sauce und Reis
Menü 2
Kl. Salat Currywurst mit Pommes
我做错了什么或者我错过了什么?
关键是在 LAParams 中给出另一个 linemargin:
LAParams(line_margin=0.1)
我的线路现在看起来像这样:
device = TextConverter(resource_manager, string_writer, codec='utf-8', laparams=LAParams(line_margin=0.1))
感谢Tim