从 Python 中的 pdf 文件中提取固定大小和位置 table

Question

假设我有许多与 here:

中的类似的 pdf 文件

我想提取以下 table 并另存为 excel 文件：

我可以使用包 excalibur.

手动提取 table 并保存 excel 文件

使用 pip3 安装 Excalibur 后，我使用以下方法初始化元数据库：

$ excalibur initdb

然后使用以下命令启动网络服务器：

$ excalibur webserver

然后转到 http://localhost:5000 并开始从 PDF 中提取表格数据。

我想知道是否可以使用 python 脚本为包含 excalibur-py, camelot, pdfminer 等包的多个 pdf 文件自动执行此操作，因为 table 的大小和位置是固定的对于同一城市的报告。

您可以从 this link 下载其他报告文件。

非常感谢。

Answer 1

使用 Camelot，您可以构建如下管道：

import camelot

files_list=['FIRST_PATH','SECOND_PATH',...]
regions=['REGION_COORDINATES_1', 'REGION_COORDINATES_2',...]

for filepath in files_list:
    tables=camelot.read_pdf(filepath, pages='1-end', table_regions=regions)
    tables.export('tables.xls', f='excel')

table_regions参数应该在知道table在页面中的大概位置时使用；如果你知道 table 的确切位置，你应该使用 table_areas.

您可以在 Camelot documentation 中阅读有关这些参数和其他主题的更多信息。

从 Python 中的 pdf 文件中提取固定大小和位置 table

Extract fixed size and position table from pdf files in Python

text-extraction

python-3.x

pdfminer

python-camelot

excalibur-py