Python Library Camelot 未阅读一页中的所有表格

Question

我正在使用 Camelot Python 库读取 pdf 文档页面中的所有 tables

我想阅读此 pdf

第 10 页的所有 table

我尝试调试绘制页面，如果我改变风格，我会注意到一些事情：

这个有味道lattice

这个有味道stream

问题是如果我使用 lattice flavor 它将无法正确读取 tables 一个例子 here

如果我使用 flavor='stream'，它将正确读取数据，但只有一个 table： output 是这样的。

我尝试使用 table_area/table_regions 来检测 flavor='stream' 的两个 table，但它没有用。我把代码贴在这里。

带点阵的代码：

import camelot

file = "2022/Auto-trend0122.pdf" 
tables = camelot.read_pdf(file,pages='10',flavor='lattice',edge_tool=1500) 
print("Total tables extracted:", tables.n) 
print(tables[0].df) camelot.plot(tables[0],filename="try_plot.png", kind='contour') 
print(tables[1].df)

有流的代码，没有table_area/table_regions:

import camelot

file = "2022/Auto-trend0122.pdf"
tables = camelot.read_pdf(file,pages='10',flavor='stream', edge_tool=1500)
print("Total tables extracted:", tables.n)
print(tables[0].df)
camelot.plot(tables[0],filename="try_plot.png", kind='contour')

代码流，table_area:

import camelot

file = "2022/Auto-trend0122.pdf"
tables = camelot.read_pdf(file,pages='10',flavor='stream',edge_tool=1500,table_area=['10,450,550,50','10,750,550,450'])
print("Total tables extracted:", tables.n)
print(tables[0].df)
camelot.plot(tables[0],filename="try_plot.png", kind='contour')

代码流，table_regions:

import camelot

file = "2022/Auto-trend0122.pdf"
tables = camelot.read_pdf(file,pages='10',flavor='stream',edge_tool=1500,table_regions=['10,450,550,50','10,750,550,450'])
print("Total tables extracted:", tables.n)
print(tables[0].df)
camelot.plot(tables[0],filename="try_plot.png", kind='contour')

table_regions/table_area/without的输出是一样的。

Answer 1

问题是您使用的是 table_area 而不是正确的参数 table_areas（阅读 docs）。

以下命令完美运行：

tables = camelot.read_pdf(file,pages='10', flavor='stream', edge_tool=1500, table_areas=['10,450,550,50','10,750,550,450'])

table_area和table_regions

的区别当您知道 table 的确切位置时，应使用

table_areas。相反，table_regions 使检测引擎仅在那些通用页面区域中查找 table。

Python Library Camelot 未阅读一页中的所有表格

Python Library Camelot not reading all tables in one page

python

pdf

python-camelot