table 从 PDF 中提取 tabula vs camelot

Question

我需要从pdf中提取表格，这些表格可以是任何类型，多个headers，垂直headers，水平header等

我已经实现了两者的基本用例，发现 tabula 比 camelot 做得好一点，但仍然无法完美地检测所有表格，我不确定它是否适用于所有类型。

所以向实施过类似用例的专家寻求建议。

Tabula 实施：

import tabula
tab = tabula.read_pdf('pdfs/PDF1.pdf', pages='all')
for t in tab:
    print(t, "\n=========================\n")

Camelot 实施：

import camelot
tables = camelot.read_pdf('pdfs/PDF1.pdf', pages='all', split_text=True)
tables
for tabs in tables:
    print(tabs.df, "\n=================================\n")

Answer 1

请阅读：https://camelot-py.readthedocs.io/en/master/#why-camelot

Camelot的主要优点是这个库有丰富的参数，通过它你可以改进提取。

显然，这些参数的应用需要一些研究和各种尝试。

Here 您可以找到 Camelot 与其他 PDF Table 提取库的比较。

Answer 2

我认为 Camelot 更好地以干净的格式提取数据而不是混乱（即数据保留信息并且行内容不受影响）。因此，在每个单元格的行数不同的情况下，提取的数据质量更好。 ->Tabula 需要 Java 运行时环境

有开放（Tabula、pdf-table-extract）源（smallpdf、PDFTables）工具广泛用于从 PDF 文件中提取 tables。他们要么给出很好的输出，要么惨败。没有介于两者之间。这没有帮助，因为现实世界中的一切，包括 PDF table 提取，都是模糊的。这导致为每种类型的 PDF table 创建临时 table 提取脚本。 Camelot 的创建是为了让用户能够完全控制 table 提取。如果您无法使用默认设置获得所需的输出，您可以调整它们并完成工作！

table 从 PDF 中提取 tabula vs camelot

tabula vs camelot for table extraction from PDF

python

pdf

tabula

python-camelot