Pdfplumber 遗漏了示意图中所有表格的第一列和最后一行

Question

我是 pdfplumber 的新手，我对它如何从 table 中提取文本感到惊讶。

它很容易用于所有页面 tables，但就我而言，我使用的是一些拓扑示意图，其中包含一些 tables。

无法提取文档中每个 table 的第一列和最后一行。我试图调整 table_settings 变量中的几个配置参数，不幸的是我无法获得更好的结果（在我的例子中，示意图中的其余文本被认为是 table如果我使用“文本”而不是“行”）。

有什么帮助吗？我正在使用 Python 3.9.8 并且可以在以下位置找到用于测试的 pdf：schematic.pdf

源码在下：

import pdfplumber
pdf_file = "Schematic.pdf"
tables=[]
with pdfplumber.open(pdf_file) as pdf:
    pages = pdf.pages
    tbl = pages[0].extract_tables()
    
    print(f'{tbl}')

Answer 1

PDF 中的一些边显示为线，但并不完全是 pdfplumber 视为线的内容，对于这种情况，所有曲线和边都可以明确地视为线。使用以下 table 设置适用于这种情况

{
    "vertical_strategy": "explicit",
    "horizontal_strategy": "explicit",
    "explicit_vertical_lines": page.curves+page.edges,
    "explicit_horizontal_lines": page.curves+page.edges,
    "intersection_tolerance": 15,
}

['(cid:47)(cid:44)(cid:54)(cid:55)(cid:36)(cid:3)(cid:39)(cid:40)(cid:3)(cid:39)(cid:40)(cid:54)(cid:57)(cid:203)(cid:50)(cid:54)', None, None, None, None, None]
['(cid:49)(cid:158)', 'PK', 'VEL.', '(cid:49)(cid:158)', 'PK', 'VEL.']
['A64', '3+100', '100 Km/h', 'A66', '3+365', '100 Km/h']
['A65', '3+189', '100 Km/h', 'S2MSU2', '5+884', '100 Km/h']
['A67', '3+363', '100 Km/h', 'S4MSU1', '6+052', '100 Km/h']
['', '', '', '', '', '']

['(cid:54)(cid:40)(cid:102)(cid:36)(cid:47)(cid:40)(cid:54)', None, None, None]
['NOMBRE', 'PK', 'NOMBRE', 'PK']
['E3', '3+720', 'EMSUF2', '5+766']
['E4', '3+784', 'EMSUF1', '5+766']
['B004F2', '4+295', 'SMSUM2', '6+185']
['B004F1', '4+295', 'SMSUM1', '6+188']
['', '', '', '']

Pdfplumber 遗漏了示意图中所有表格的第一列和最后一行

Pdfplumber misses first column and last row for all tables within a schematic

python

pdfplumber