我无法提取最后一页的内容，有人可以调试吗？

Question

我正在尝试将 pdf 转换为两个列表：标题和内容。但我发现此功能不适用于 pdf 最后一页.

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer,LTChar
#pdf--> title list and content list 
def extract_title_content(path):
    title=[]
    content=[]
    a=""
    b=""   
    mode,minn= check_size(path)
    for page_layout in extract_pages(path):
        title.append(a)
        content.append(b)
        a=""
        b=""           
        for element in page_layout:
            if isinstance(element, LTTextContainer):
                for text_line in element:               
                    for character in text_line:
                        if isinstance(character, LTChar):                       
                            if character.size > mode:
                                a+=character.get_text()
                            elif character.size> minn:
                                b+=character.get_text()
                            else:
                                pass  
    return title,content

Answer 1

在外循环中，首先将 a 中最近提取的较大文本添加到 title，将 b 中的中等文本添加到 content，然后清除 a 和 b，然后将新文本提取到 a 和 b:

for page_layout in extract_pages(path):
    title.append(a)
    content.append(b)
    a=""
    b=""           
    [... extract into a and b ...]

因此，您从最后一页提取的内容永远不会添加到 title 和 content。

要解决此问题，请将 a 和 b 的添加移动到 title 和 content after 填充 a 和 b:

for page_layout in extract_pages(path):
    [... extract into a and b ...]
    title.append(a)
    content.append(b)
    a=""
    b=""

或者，如果您出于某种原因在填充之前添加 ，请在循环后再次显式添加：

for page_layout in extract_pages(path): title.append(a) content.append(b) a="" b="" [... extract into a and b ...] title.append(a) content.append(b)

我无法提取最后一页的内容，有人可以调试吗？

i can't extract last pages content, can some one debug?

python

pdf

debugging