我如何提取 tabula 以外的 pdf 表格

Question

我有一个工作脚本，我们必须在其中使用 tabula 包读取 pdf 表格，但由于 tabula 依赖于 Java 8，我们必须使用 java 6 及以下版本，因为一些内部工具，我们如何阅读表格的pdf表格。

from tabula import read_pdf
df_list = tabula.read_pdf(current_file, pages="all", lattice = True)

Answer 1

如何将 pdf 文档转换为 excel 电子表格：

方案一，使用pdf_tables API:

安装 pdf_tables pip install git+https://github.com/pdftables/python-pdftables-api.git
获取帐户here

安装完所有内容后，您可以运行此代码：

import pdftables_api

c = pdftables_api.Client('my-api-key')
c.xlsx('input.pdf', 'output') 
#replace c.xlsx with c.csv to convert to CSV 
#replace c.xlsx with c.xml to convert to XML
#replace c.xlsx with c.html to convert to HTML
#This is documentation code for your information

不要忘记将 my-api-key 替换为您的 api 密钥，将 input.pdf 替换为您的 pdf 路径，并输出到您所在目录的路径想将输出 excel 文档保存到.

方案2，使用textract读取pdf，然后使用xlwt写入电子表格：

使用 pip install textract
使用 pip install xlwt

安装依赖项后，您可以运行以下代码：

import textract
import xlwt
from xlwt import Workbook

wb = Workbook()

text = textract.process("path/to/file.extension") #You'll have to change this to your path to the file

我不知道你的 pdf 是如何组织的，但你必须弄清楚如何从那里写入 excel 文档。（您可以使用 sheet1.write(1, 0, 'Data')，其中 1 和 0 是您在电子表格中的坐标。

我个人认为您应该使用 pdf_tables API 而不是手动进行转换。

我如何提取 tabula 以外的 pdf 表格

How can i extract pdf tables other than tabula

python

tabula