使用 tabula 和 python 读取 pdf 文件时数据丢失

Question

我有一个包含多个文本和 tables 的 pdf，一行包含如下所示：

PDF content :
Id: 5647484848 Name Alex J

现在我正在使用 tabula-py 来解析内容，但结果丢失了一些东西（意味着您可以看到第一个字符或数字丢失）。

实际上我的原始 pdf 有很多文本和 tables。我也尝试了其他行，在那里我得到了正确的结果。

Wrong Result :
['', '', 'Id:', '', '647484848', 'Name', '', 'lex J', '', '', '']

Should be :
['', '', 'Id:', '', '5647484848', 'Name', '', 'Alex J', '', '', '']

样本：

# to get the exact row to find the name & index [7] is for Name
if len(row) == 11:
    if "Name" in row:
       print(row[7])
       return Student(studentname=row[7])

在阅读 table 的表格中，我设置了

df = tabula.read_pdf(pdf, output_format='json', pages='all',
                          password=secure_password, lattice=True)

该行是简单的文本类型，没有图像。不知道为什么它对这个特定的行数据失败。我对其他行应用了类似的逻辑，得到了正确的结果。请建议。

Answer 1

通过将 tabula-py 中的提取模式从 lattice=True 更改为 lattice=False 解决

df = tabula.read_pdf(pdf, output_format='json', pages='all',
                          password=secure_password, lattice=False)

使用 tabula 和 python 读取 pdf 文件时数据丢失

data missing while reading pdf file using tabula and python

python

pdf

tabula

tabula-py