使用 python3 的 pdfminer 库提取 pdf 文件的第一页
Extract first page of pdf file using pdfminer library of python3
我想从 pdffile 中获取第一页数据。
我使用了pdfminer并在输出中获得了pdffile的所有数据,但我只想获取pdffile的第一页数据。我该怎么办?
我的代码如下。
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTChar,LTLine,LAParams
import os
path=r'/home/user/Desktop/abc.pdf'
Extract_Data=[]
for page_layout in extract_pages(path):
print(page_layout)
for element in page_layout:
if isinstance(element, LTTextContainer):
for text_line in element:
for character in text_line:
if isinstance(character, LTChar):
Font_size=character.size
Extract_Data.append([Font_size,(element.get_text())])
我想我的问题是 page_layout。
如何只获取首页数据??
extract_pages
有一个可选参数可以做到这一点:
def extract_pages(pdf_file, password='', page_numbers=None, maxpages=0,
caching=True, laparams=None):
"""Extract and yield LTPage objects
:param pdf_file: Either a file path or a file-like object for the PDF file
to be worked on.
:param password: For encrypted PDFs, the password to decrypt.
:param page_numbers: List of zero-indexed page numbers to extract.
:param maxpages: The maximum number of pages to parse
所以如果我理解正确的话extract_pages(path, page_numbers=[0], maxpages=1)[0]
应该return只有第一页数据。
我想从 pdffile 中获取第一页数据。
我使用了pdfminer并在输出中获得了pdffile的所有数据,但我只想获取pdffile的第一页数据。我该怎么办?
我的代码如下。
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTChar,LTLine,LAParams
import os
path=r'/home/user/Desktop/abc.pdf'
Extract_Data=[]
for page_layout in extract_pages(path):
print(page_layout)
for element in page_layout:
if isinstance(element, LTTextContainer):
for text_line in element:
for character in text_line:
if isinstance(character, LTChar):
Font_size=character.size
Extract_Data.append([Font_size,(element.get_text())])
我想我的问题是 page_layout。 如何只获取首页数据??
extract_pages
有一个可选参数可以做到这一点:
def extract_pages(pdf_file, password='', page_numbers=None, maxpages=0,
caching=True, laparams=None):
"""Extract and yield LTPage objects
:param pdf_file: Either a file path or a file-like object for the PDF file
to be worked on.
:param password: For encrypted PDFs, the password to decrypt.
:param page_numbers: List of zero-indexed page numbers to extract.
:param maxpages: The maximum number of pages to parse
所以如果我理解正确的话extract_pages(path, page_numbers=[0], maxpages=1)[0]
应该return只有第一页数据。