如何检查图像是否包含文本？

Question

给定扫描文档的任何图像，我想检查它是否不是空白页。我知道我可以将它发送到 AWS Textract - 但它会免费。

我知道我可以使用 pytesseract，但也许有更优雅、更简单的解决方案？或者给定一个代表图像文本的 .html 文件 - 如何检查它是否显示空白页？

Answer 1

我们可以通过 阈值化 图像并将其传递给 tesseract 来为此应用程序使用 pytesseract。但是，如果您有一个代表图像文本的 .html 文件，您可以使用 beautifulsoup 从中提取文本并检查它是否是 empty.Still 这是一种迂回的方法。

Answer 2

如果您需要在不通过 Pytesseract 的情况下省去麻烦，PyMuPDF 将是您的另一种选择。这只是一个示例，说明如何从扫描图像或干净格式的 PDF 中提取文本：

import fitz

input_file = 'path/to/your/file'
pdf_file = input_file
doc = fitz.open(pdf_file) # open pdf files using fitz bindings 
noOfPages = doc.pageCount # Here is how you get number of pages 

for pageNo in range(noOfPages):
    page = doc.loadPage(pageNo) # number of pages
    blocks = page.getText("blocks")
    blocks.sort(key=lambda block: block[3])  # sort by 'y1' values

    for block in blocks:
        print(block[4])  # print the lines of this block or do your check here

page.getText(option) 可能是您最好的选择，而 option 是一个控制输出类型的字符串。您可以选择纯文本、带位置信息的单个单词、HTML 或 XML 字符串输出、Python 字典格式的完整页面内容等等。

编辑：

处理 jpg 的一种快速方法是使用以下方法将其转换回 pdf：

pdfbytes = doc.convertToPDF()
pdf = fitz.open('pdf',pdfbytes)

如果您不想将其转换回 pdf，请使用 page.getText 和“dict”参数。这将创建页面上所有图像的列表：

d = page.getText("dict")
blocks = d["blocks"]
imgblocks = [b for b in blocks if b["type"] == 1]

如果它们都不能满足您的需求，那么 PIL 图书馆可能是您的下一个选择。如果您需要额外的信息，这里是 official documentation for PyMuPDF and here，因为您在其他线程中提到了 HTML。

如何检查图像是否包含文本？

How to check if image contains text or not?

python

ocr

opencv

python-tesseract