PDFminer 从每页 headers 获取字体大小(迭代)
PDFminer get font size from headers per each page (iteration)
我对 python 和 PDFminer 很陌生,这对我来说有点复杂,我想要实现的是从 pdf 文件或幻灯片中提取每一页的标题。
我的方法是获取文本行的列表和每页的字体大小,然后我会选择最大的数字作为幻灯片标题,通常以较大的字体大小书写。
这是我目前所做的:
假设我想从此 pdf 文件中获取第 #8 页的标题。 File sample
这是第 8 页内容的样子:
这是获取所有页面每行字体大小的代码:
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTChar,LTLine,LAParams
import os
path=r'cov.pdf'
Extract_Data=[]
for page_layout in extract_pages(path):
for element in page_layout:
if isinstance(element, LTTextContainer):
for text_line in element:
for character in text_line:
if isinstance(character, LTChar):
Font_size=character.size
Extract_Data.append([Font_size,(element.get_text())])
生成的列表Extract_Data
适用于pdf文档的所有页面。我的问题是如何为文档的每一页(迭代)获取此列表?
仅第 8 页的预期输出,依此类推/然后如果我想选择页面标题,它将是字体大小值最高的项目(行):
[[32.039999999999964, 'Pandemic declaration \n'],
[24.0, ' \n'],
[24.0, ' \n'],
[24.0,
'• On March 11, 2020, the World Health Organization \n(WHO) characterized COVID-19 as a pandemic. \n \n• It has caused severe illness and death. It features \n \nsustained person-to-person spread worldwide. \n'],
[24.0, ' \n'],
[24.0, ' \n'],
[24.0, ' \n'],
[24.0, ' \n'],
[24.0, '• It poses an especially high risk for the elderly (60 or \n \n'],
[24.0, ' \n'],
[24.0, ' \n'],
[24.0, ' \n'],
[24.0, ' \n'],
[24.0, ' \n'],
[24.0,
'older), people with preexisting health conditions such \nas high blood pressure, heart disease, lung disease, \n \ndiabetes, autoimmune disorders, and certain workers. \n \n'],
[24.0, ' \n'],
[24.0, ' \n'],
[24.0, ' \n'],
[24.0, ' \n'],
[24.0, ' \n'],
[14.04, '8 \n']]
完全公开,我是 pdfminer.six 的维护者之一。
执行此操作的 pythonic 方式如下。
import os
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTChar
def get_font_sizes(paragraph: LTTextContainer):
"""Get the font sizes for every LTChar element in this LTTextContainer"""
return [
char.size
for line in paragraph
for char in line
if isinstance(char, LTChar)
]
def list_sized_paragraphs(page):
"""List all the paragraphs and their maximum font size on this page"""
return [
(max(get_font_sizes(paragraph)), paragraph.get_text())
for paragraph in page
if isinstance(paragraph, LTTextContainer)
]
file_path = '~/Downloads/covid_19_training_tool_v3_01.05.2021_508.pdf'
for page in extract_pages(os.path.expanduser(file_path)):
_, text = max(list_sized_paragraphs(page))
print('---')
print(text.strip())
第 8 页打印:
Pandemic declaration
注意:这不适用于所有页面,因为有时警告或注释的字体比 header 大。
我对 python 和 PDFminer 很陌生,这对我来说有点复杂,我想要实现的是从 pdf 文件或幻灯片中提取每一页的标题。
我的方法是获取文本行的列表和每页的字体大小,然后我会选择最大的数字作为幻灯片标题,通常以较大的字体大小书写。
这是我目前所做的:
假设我想从此 pdf 文件中获取第 #8 页的标题。 File sample
这是第 8 页内容的样子:
这是获取所有页面每行字体大小的代码:
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTChar,LTLine,LAParams
import os
path=r'cov.pdf'
Extract_Data=[]
for page_layout in extract_pages(path):
for element in page_layout:
if isinstance(element, LTTextContainer):
for text_line in element:
for character in text_line:
if isinstance(character, LTChar):
Font_size=character.size
Extract_Data.append([Font_size,(element.get_text())])
生成的列表Extract_Data
适用于pdf文档的所有页面。我的问题是如何为文档的每一页(迭代)获取此列表?
仅第 8 页的预期输出,依此类推/然后如果我想选择页面标题,它将是字体大小值最高的项目(行):
[[32.039999999999964, 'Pandemic declaration \n'],
[24.0, ' \n'],
[24.0, ' \n'],
[24.0,
'• On March 11, 2020, the World Health Organization \n(WHO) characterized COVID-19 as a pandemic. \n \n• It has caused severe illness and death. It features \n \nsustained person-to-person spread worldwide. \n'],
[24.0, ' \n'],
[24.0, ' \n'],
[24.0, ' \n'],
[24.0, ' \n'],
[24.0, '• It poses an especially high risk for the elderly (60 or \n \n'],
[24.0, ' \n'],
[24.0, ' \n'],
[24.0, ' \n'],
[24.0, ' \n'],
[24.0, ' \n'],
[24.0,
'older), people with preexisting health conditions such \nas high blood pressure, heart disease, lung disease, \n \ndiabetes, autoimmune disorders, and certain workers. \n \n'],
[24.0, ' \n'],
[24.0, ' \n'],
[24.0, ' \n'],
[24.0, ' \n'],
[24.0, ' \n'],
[14.04, '8 \n']]
完全公开,我是 pdfminer.six 的维护者之一。
执行此操作的 pythonic 方式如下。
import os
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTChar
def get_font_sizes(paragraph: LTTextContainer):
"""Get the font sizes for every LTChar element in this LTTextContainer"""
return [
char.size
for line in paragraph
for char in line
if isinstance(char, LTChar)
]
def list_sized_paragraphs(page):
"""List all the paragraphs and their maximum font size on this page"""
return [
(max(get_font_sizes(paragraph)), paragraph.get_text())
for paragraph in page
if isinstance(paragraph, LTTextContainer)
]
file_path = '~/Downloads/covid_19_training_tool_v3_01.05.2021_508.pdf'
for page in extract_pages(os.path.expanduser(file_path)):
_, text = max(list_sized_paragraphs(page))
print('---')
print(text.strip())
第 8 页打印:
Pandemic declaration
注意:这不适用于所有页面,因为有时警告或注释的字体比 header 大。