无法仅从网页的 pdf 文件中获取 table 中的名称

Question

我在 python 中使用 requests 模块和 PyPDF2 库创建了一个脚本来解析来自网站的 pdf 内容。我只对该 pdf 文件第 4 页（表格内容）中 Facility Name 下的 column A 中的名称感兴趣。我的脚本可以从该页面抓取内容，但我找不到任何方法来只获取名称而不获取其他任何内容。

pdf file link that I've used within the script

这就是 table 的样子

我只对 header Facility Name.

栏下的名称感兴趣

我试过：

import io
import PyPDF2
import requests

URL = 'https://www.cms.gov/Medicare/Provider-Enrollment-and-Certification/CertificationandComplianc/Downloads/SFFList.pdf'

res = requests.get(URL)
f = io.BytesIO(res.content)
reader = PyPDF2.PdfFileReader(f)
contents = reader.getPage(3).extractText()
print(contents)

我现在的输出如下：

Facilit
y Name
Address
City
State
Zip
Phone 
Number
Months as an 
SFFWillows Center
320 North Crawford Street
Willows
CA95988530-934-2834
5Winter Park Care & Rehabilitation Center
2970 Scarlett Rd
Winter Park
FL32792407-671-8030
and so on -----

我希望有这样的输出：

Willows Center
Winter Park Care & Rehabilitation Center
Pinehill Nursing Center
River Brook Healthcare Center

如何从 pdf 文件中只获取 table 中可用的名称？

Answer 1

很遗憾，PDF 不是结构化文档，它只是 strings/images 放置在坐标上，无论哪个程序呈现它，它看起来都与创建时完全一样。这意味着您不能像 HTML 那样简单地解析它，因为 table 不在 <table> 元素下，而是分散在整个页面中。

参见：

If identifying text structure in PDF documents is so difficult, how do PDF readers do it so well?
How to extract data from a PDF file while keeping track of its structure?

看看https://github.com/atlanhq/camelot，它可能对你有帮助

（那里最多有 10 页带有 table，手动可能是一个更快的选择，除非你有很多这样的 PDF。）

无法仅从网页的 pdf 文件中获取 table 中的名称

Can't fetch only the names from a table located in a pdf file from a webpage

python

python-3.x

web-scraping

pypdf2