如何使用 python 代码查找 PDF 文件每一段的字体大小?
How to find the Font Size of every paragraph of PDF file using python code?
现在我正在做一个项目,我必须在该项目中找到该 PDF 文件中每个段落的字体大小。我尝试了各种 python 库,例如 fitz、PyPDF2、pdfrw、pdfminer、pdfreader。所有库都获取文本数据,但我不知道如何获取段落的字体大小。
提前致谢..感谢您的帮助。
我已经试过了,但无法获取字体大小。
import fitz
filepath = '/home/user/Downloads/abc.pdf'
text = ''
with fitz.open(filepath ) as doc:
for page in doc:
text+= page.getText()
print(text)
我从 pdfminer 得到了解决方案。
python 代码如下。
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTChar,LTLine,LAParams
import os
path=r'/path/to/pdf'
Extract_Data=[]
for page_layout in extract_pages(path):
for element in page_layout:
if isinstance(element, LTTextContainer):
for text_line in element:
for character in text_line:
if isinstance(character, LTChar):
Font_size=character.size
Extract_Data.append([Font_size,(element.get_text())])
更好的方法是使用 fitz
本身。与 pdfminer
相比,这个库在抓取字体信息方面明显更快更清晰。示例代码片段如下所示。
import fitz
def scrape(keyword, filePath):
results = [] # list of tuples that store the information as (text, font size, font name)
pdf = fitz.open(filePath) # filePath is a string that contains the path to the pdf
for page in pdf:
dict = page.get_text("dict")
blocks = dict["blocks"]
for block in blocks:
if "lines" in block.keys():
spans = block['lines']
for span in spans:
data = span['spans']
for lines in data:
if keyword in lines['text'].lower(): # only store font information of a specific keyword
results.append((lines['text'], lines['size'], lines['font']))
# lines['text'] -> string, lines['size'] -> font size, lines['font'] -> font name
pdf.close()
return results
如果要查找每一行的字体信息,可以省略检查特定关键字的if条件。[=15=]
您可以通过理解 structure of dictionary outputs that we obtain by using get_text("dict")
, as mentioned in the documentation.
提取任何所需格式的文本信息
现在我正在做一个项目,我必须在该项目中找到该 PDF 文件中每个段落的字体大小。我尝试了各种 python 库,例如 fitz、PyPDF2、pdfrw、pdfminer、pdfreader。所有库都获取文本数据,但我不知道如何获取段落的字体大小。 提前致谢..感谢您的帮助。
我已经试过了,但无法获取字体大小。
import fitz
filepath = '/home/user/Downloads/abc.pdf'
text = ''
with fitz.open(filepath ) as doc:
for page in doc:
text+= page.getText()
print(text)
我从 pdfminer 得到了解决方案。 python 代码如下。
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTChar,LTLine,LAParams
import os
path=r'/path/to/pdf'
Extract_Data=[]
for page_layout in extract_pages(path):
for element in page_layout:
if isinstance(element, LTTextContainer):
for text_line in element:
for character in text_line:
if isinstance(character, LTChar):
Font_size=character.size
Extract_Data.append([Font_size,(element.get_text())])
更好的方法是使用 fitz
本身。与 pdfminer
相比,这个库在抓取字体信息方面明显更快更清晰。示例代码片段如下所示。
import fitz
def scrape(keyword, filePath):
results = [] # list of tuples that store the information as (text, font size, font name)
pdf = fitz.open(filePath) # filePath is a string that contains the path to the pdf
for page in pdf:
dict = page.get_text("dict")
blocks = dict["blocks"]
for block in blocks:
if "lines" in block.keys():
spans = block['lines']
for span in spans:
data = span['spans']
for lines in data:
if keyword in lines['text'].lower(): # only store font information of a specific keyword
results.append((lines['text'], lines['size'], lines['font']))
# lines['text'] -> string, lines['size'] -> font size, lines['font'] -> font name
pdf.close()
return results
如果要查找每一行的字体信息,可以省略检查特定关键字的if条件。[=15=]
您可以通过理解 structure of dictionary outputs that we obtain by using get_text("dict")
, as mentioned in the documentation.