在 Python 中，阅读没有大纲的 pdf table 的最佳方式是什么？

Question

我正在尝试将 pdf 格式的 table 中的数据读取到 pandas 数据帧中。当 pdf 有围绕 table 的大纲时，我可以使用 tabula-py 这样做，但是当我尝试没有大纲的 pdf 时，脚本会产生错误。

例如，我正在查看可从两个不同网址获得的 pdf。我已经从网址下载了 pdf 并将它们分别保存为 'JSE Opts.pdf' 和 'JSE Divs.pdf'。

import requests
import pandas as pd

url='https://clientportal.jse.co.za/JSE%20Equity%20Derivatives/Dividends/ED_DividendsReport.pdf'
response = requests.get(url)
fname = 'JSE Divs.pdf'
f= open(fname, 'wb')
f.write(response.content)
f.close()        
    
url='https://clientportal.jse.co.za/JSE%20Equity%20Derivatives/Options%20Daily%20Traded%20Report/ED_OptionsDailyTradedReport.pdf'
response = requests.get(url)
fname = 'JSE Opts.pdf'
f= open(fname, 'wb')
f.write(response.content)
f.close()

我可以使用以下代码将 'JSE Opts.pdf' 读入 pandas 数据帧：

import tabula as tb

pdf = './JSE Opts.pdf'
data = tb.read_pdf(pdf,pages = 1)
data = data[0]
print(data)

当我尝试对 'JSE Divs.pdf' 执行相同操作时，出现错误并且 tabula-py 只能读取 header:

pdf = './JSE Divs.pdf'
data = tb.read_pdf(pdf,pages = 1)
data = data[0]
print(data)

我怀疑这是因为 table 周围没有线条。如果是这样，将数据从 'JSE Divs.pdf' 读取到 pandas 的最佳方法是什么？

Answer 1

我能够使用 pdfplumber 将数据读入字符串，将字符串保存为 CSV 文件（在清理数据以满足我的需要之后），然后导入到 pandas。

import pdfplumber
pdf = pdfplumber.open("./JSE Divs.pdf")

text = ''
i = 0
while True:
    try:
        text += pdf.pages[i].extract_text() + '\n'
        i = i+1
    except IndexError:
        break

for replace_s in [' DN',' CA1',' ANY',' CSH',' PHY',' QUANTO']:
    text = text.replace(replace_s,'')

while True:
    try:
        idx = text.index('EXO')
        replace_s =text[idx-1:idx+8]
        text = text.replace(replace_s,'')
    except ValueError:
        break

cols ='EXPIRY_s,USYM,EXPIRY,EX_DATE,CUM_PV_DIVS,CUM_DIVS,ISIN,INSTR_ID\n'
text = text[text.index('Div\n')+4:]
text = cols + text
text = text.replace(' ',',')

f = open('divs.csv','w')
f.write(text)
f.close()

在 Python 中，阅读没有大纲的 pdf table 的最佳方式是什么？

In Python what is the best way to read a pdf table with no outline?

python

pdf

pandas

tabula

tabula-py