Tabula-py 没有正确拆分列

Question

我刚刚发现了 tabula-py（当然还有 tabula-java）从 pdf 中提取 tables 的乐趣。我现在正在为我的工作编写一个脚本，该脚本从 pdf table 中读取一些数据，对其进行一些清理并将其导出到 excel。我用的pdf每天都是一样的格式，table总是在某个区域。为了检测该区域，我使用 tabula.exe: I select table，可视化预览（看起来不错），然后导出脚本，以便查看 -a 参数被 tabula.exe 使用。然后我在 Python 的命令中使用它，即：

df = tabula.read_pdf(os.fsdecode(directory)+filename, encoding = 'ISO-8859-1',
stream=True, area = "81.106,302.475,384.697,552.491", pages = 2, pandas_options={'header':None})

我使用编码参数是因为标准 utf-8 returns 是一个错误，而使用流方法是因为它在 tabula.exe 中显示了一个很好的提取 table .但是，数据框有一个问题，因为前 2 列（在 tabula.exe 的预览中正确显示为 2 个不同的列）实际上是一个单独的列，因此名称和值混合在一起。

您知道为什么同一个区域在 tabula-py 和 tabula.exe 中会产生 2 个不同的结果吗？非常感谢！

Answer 1

在 GitHub 上弄清楚了：tabula-py 默认将 "guess" 选项设置为 True。所以要纠正这个差异，你可以只添加 guess=False，输出是一样的！

    df = tabula.read_pdf(os.fsdecode(directory)+filename, encoding = 'ISO-8859-1', 
         stream=True, area = "81.106,302.475,384.697,552.491", pages = 2, guess = False,  pandas_options={'header':None})

Answer 2

如果其他人对在何处描绘表格和列感到困惑，您可以使用 Adobe Acrobat 轻松找到准确的尺寸。在 Adobe Acrobat 中打开 pdf，打开标尺，并将其设置为点。放大到底，您可以看到精确的点测量值以拆分 area/tables。

Tabula-py 没有正确拆分列

Tabula-py is not splitting columns right

python

pdf

python-3.x

tabula