我无法提取最后一页的内容,有人可以调试吗?

i can't extract last pages content, can some one debug?

我正在尝试将 pdf 转换为两个列表:标题和内容。但我发现此功能不适用于 pdf 最后一页.

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer,LTChar
#pdf--> title list and content list 
def extract_title_content(path):
    title=[]
    content=[]
    a=""
    b=""   
    mode,minn= check_size(path)
    for page_layout in extract_pages(path):
        title.append(a)
        content.append(b)
        a=""
        b=""           
        for element in page_layout:
            if isinstance(element, LTTextContainer):
                for text_line in element:               
                    for character in text_line:
                        if isinstance(character, LTChar):                       
                            if character.size > mode:
                                a+=character.get_text()
                            elif character.size> minn:
                                b+=character.get_text()
                            else:
                                pass  
    return title,content

在外循环中,首先将 a 中最近提取的较大文本添加到 title,将 b 中的中等文本添加到 content,然后清除 ab,然后将新文本提取到 ab:

for page_layout in extract_pages(path):
    title.append(a)
    content.append(b)
    a=""
    b=""           
    [... extract into a and b ...]

因此,您从最后一页提取的内容永远不会添加到 titlecontent

要解决此问题,请将 ab 的添加移动到 titlecontent after 填充 ab:

for page_layout in extract_pages(path):
    [... extract into a and b ...]
    title.append(a)
    content.append(b)
    a=""
    b=""           

或者,如果您出于某种原因在 填充之前添加​​ ,请在循环后再次显式添加:

for page_layout in extract_pages(path):
    title.append(a)
    content.append(b)
    a=""
    b=""           
    [... extract into a and b ...]
title.append(a)
content.append(b)