如何使用 PyPDF2 提取目录?
How can I extract the TOC with PyPDF2?
以this pdf为例。我可以使用 dumppdf.py -T 1707.09725.pdf
:
提取内容 (TOC) 的 table
<outlines>
<outline level="1" title="1 Introduction">
<dest>
<list size="5">
<ref id="513"/>
<literal>XYZ</literal>
<number>99.213</number>
<number>742.911</number>
<null/>
</list>
</dest>
<pageno>14</pageno>
</outline>
<outline level="1" title="2 Convolutional Neural Networks">
<dest>
<list size="5">
<ref id="554"/>
<literal>XYZ</literal>
<number>99.213</number>
<number>742.911</number>
<null/>
</list>
</dest>
<pageno>16</pageno>
</outline>
...
我可以用 PyPDF2 做类似的事情吗?
找到了:
from PyPDF2 import PdfFileReader
reader = PdfFileReader(open("1707.09725.pdf", 'rb'))
print(reader.outlines)
给出:
[{'/Title': '1 Introduction', '/Left': 99.213, '/Type': '/XYZ', '/Top': 742.911, '/Zoom': ..., '/Page': IndirectObject(513, 0)},
{'/Title': '2 Convolutional Neural Networks', '/Left': 99.213, '/Type': '/XYZ', '/Top': 742.911, '/Zoom': ..., '/Page': IndirectObject(554, 0)}, [{'/Title': '2.1 Linear Image Filters', '/Left': 99.213, '/Type': '/XYZ', '/Top': 486.791, '/Zoom': ..., '/Page': IndirectObject(554, 0)},
{'/Title': '2.2 CNN Layer Types', '/Left': 70.866, '/Type': '/XYZ', '/Top': 316.852, '/Zoom': ..., '/Page': IndirectObject(580, 0)},
[{'/Title': '2.2.1 Convolutional Layers', '/Left': 99.213, '/Type': '/XYZ', '/Top': 562.722, '/Zoom': ..., '/Page': IndirectObject(608, 0)},
{'/Title': '2.2.2 Pooling Layers', '/Left': 99.213, '/Type': '/XYZ', '/Top': 299.817, '/Zoom': ..., '/Page': IndirectObject(654, 0)},
{'/Title': '2.2.3 Dropout', '/Left': 99.213, '/Type': '/XYZ', '/Top': 742.911, '/Zoom': ..., '/Page': IndirectObject(689, 0)},
{'/Title': '2.2.4 Normalization Layers', '/Left': 99.213, '/Type': '/XYZ', '/Top': 193.779, '/Zoom': <PyPDF2.generic.NullObject object at 0x7fbe49d14350>, '/Page': IndirectObject(689, 0)}]
或者,按照 this answer you can use pikepdf
的建议
from pikepdf import Pdf
path = "path/to/file.pdf"
with Pdf.open(path) as pdf:
outline = pdf.open_outline()
for title in outline.root:
print(title)
for subtitle in title.children:
print('\t', subtitle)
以this pdf为例。我可以使用 dumppdf.py -T 1707.09725.pdf
:
<outlines>
<outline level="1" title="1 Introduction">
<dest>
<list size="5">
<ref id="513"/>
<literal>XYZ</literal>
<number>99.213</number>
<number>742.911</number>
<null/>
</list>
</dest>
<pageno>14</pageno>
</outline>
<outline level="1" title="2 Convolutional Neural Networks">
<dest>
<list size="5">
<ref id="554"/>
<literal>XYZ</literal>
<number>99.213</number>
<number>742.911</number>
<null/>
</list>
</dest>
<pageno>16</pageno>
</outline>
...
我可以用 PyPDF2 做类似的事情吗?
找到了:
from PyPDF2 import PdfFileReader
reader = PdfFileReader(open("1707.09725.pdf", 'rb'))
print(reader.outlines)
给出:
[{'/Title': '1 Introduction', '/Left': 99.213, '/Type': '/XYZ', '/Top': 742.911, '/Zoom': ..., '/Page': IndirectObject(513, 0)},
{'/Title': '2 Convolutional Neural Networks', '/Left': 99.213, '/Type': '/XYZ', '/Top': 742.911, '/Zoom': ..., '/Page': IndirectObject(554, 0)}, [{'/Title': '2.1 Linear Image Filters', '/Left': 99.213, '/Type': '/XYZ', '/Top': 486.791, '/Zoom': ..., '/Page': IndirectObject(554, 0)},
{'/Title': '2.2 CNN Layer Types', '/Left': 70.866, '/Type': '/XYZ', '/Top': 316.852, '/Zoom': ..., '/Page': IndirectObject(580, 0)},
[{'/Title': '2.2.1 Convolutional Layers', '/Left': 99.213, '/Type': '/XYZ', '/Top': 562.722, '/Zoom': ..., '/Page': IndirectObject(608, 0)},
{'/Title': '2.2.2 Pooling Layers', '/Left': 99.213, '/Type': '/XYZ', '/Top': 299.817, '/Zoom': ..., '/Page': IndirectObject(654, 0)},
{'/Title': '2.2.3 Dropout', '/Left': 99.213, '/Type': '/XYZ', '/Top': 742.911, '/Zoom': ..., '/Page': IndirectObject(689, 0)},
{'/Title': '2.2.4 Normalization Layers', '/Left': 99.213, '/Type': '/XYZ', '/Top': 193.779, '/Zoom': <PyPDF2.generic.NullObject object at 0x7fbe49d14350>, '/Page': IndirectObject(689, 0)}]
或者,按照 this answer you can use pikepdf
from pikepdf import Pdf
path = "path/to/file.pdf"
with Pdf.open(path) as pdf:
outline = pdf.open_outline()
for title in outline.root:
print(title)
for subtitle in title.children:
print('\t', subtitle)