如何将 Web PDF 转换为文本

Question

我想将网络 PDF（例如 - https://archives.nseindia.com/corporate/ICRA_26012022091856_BSER3026012022.pdf 等）转换为文本而不将它们保存到我的电脑中，因为每天都会出现 1000 条这样的通知，因此我想将它们转换为文本而不保存他们在我的电脑上。对此有任何 Python 代码解决方案吗？谢谢

Answer 1

有不同的方法可以做到这一点。但最简单的方法是在本地下载 PDF，然后使用以下 Python 模块之一提取文本 (OCR)：

这是一个简单的代码示例（使用 pdfplumber）

from urllib.request import urlopen
import pdfplumber
url = 'https://archives.nseindia.com/corporate/ICRA_26012022091856_BSER3026012022.pdf'
response = urlopen(url)
file = open("img.pdf", 'wb')
file.write(response.read())
file.close()
try:
    pdf = pdfplumber.open('img.pdf')
except: 
    # Some files are not pdf, these are annexes and we don't want them. Or error reading the pdf (damaged ? )
    print(f'Error. Are you sure this is a PDF ?')
    continue
#PDF plumber text extraction
page = pdf.pages[0]
text = page.extract_text()

编辑： 糟糕，我才意识到你问的是“没有将它保存到我的电脑”。话虽这么说，我也废弃了很多（还有 1000 个）pdf，但都将它们保存为“img.pdf”，所以它们只是不断地相互替换，最后只有 1 个 pdf 文件。在不保存文件的情况下，我不提供任何 PDF OCR 解决方案。对不起:'(

如何将 Web PDF 转换为文本

How to convert Web PDF to Text

html

python

pdf

web-scraping

pdftotext