如何检查PDF是扫描图像还是包含文本
How to check if PDF is scanned image or contains text
我有很多文件,有些是图片扫描成PDF,有些是full/partial文本PDF。
有没有办法检查这些文件以确保我们只处理扫描图像文件而不是 full/partial 文本 PDF 文件?
环境:PYTHON3.6
下面的代码可以工作,从可搜索和不可搜索的 PDF 中提取数据文本数据。
import fitz
text = ""
path = "Your_scanned_or_partial_scanned.pdf"
doc = fitz.open(path)
for page in doc:
text += page.getText()
如果您没有 fitz
模块,您需要这样做:
pip install --upgrade pymupdf
'/Resources'
上的 PDF 元数据检查怎么样?!
我相信对于PDF(电子文档)中的任何文本,都有更多的机会拥有字体,尤其是PDF,其objective是为了制作可移植文件,因此,它保持了字体定义.
如果您是 PyPDF2
用户,请尝试
pdf_reader = PyPDF2.PdfFileReader(input_file_location)
page_data = pdf_reader.getPage(page_num)
page_resources = page_data["/Resources"]
if "/Font" in page_resources:
print(
"[Info]: Looks like there is text in the PDF, contains:",
page_resources.keys(),
)
elif len(page_resources.get("/XObject", {})) != 1:
print("[Info]: PDF Contains:", page_resources.keys())
x_object = page_resources.get("/XObject", {})
for obj in x_object:
obj_ = x_object[obj]
if obj_["/Subtype"] == "/Image":
print("[Info]: PDF is image only")
尝试OCRmyPDF。
您可以使用此命令将扫描的 pdf 转换为数字 pdf。
ocrmypdf input_scanned.pdf output_digital.pdf
如果输入的 pdf 是数字的,命令将抛出错误 "PriorOcrFoundError: page already has text!"。
import subprocess as sp
import re
output = sp.getoutput("ocrmypdf input.pdf output.pdf")
if not re.search("PriorOcrFoundError: page already has text!",output):
print("Uploaded scanned pdf")
else:
print("Uploaded digital pdf")
def get_pdf_searchable_pages(fname):
# pip install pdfminer
from pdfminer.pdfpage import PDFPage
searchable_pages = []
non_searchable_pages = []
page_num = 0
with open(fname, 'rb') as infile:
for page in PDFPage.get_pages(infile):
page_num += 1
if 'Font' in page.resources.keys():
searchable_pages.append(page_num)
else:
non_searchable_pages.append(page_num)
if page_num > 0:
if len(searchable_pages) == 0:
print(f"Document '{fname}' has {page_num} page(s). "
f"Complete document is non-searchable")
elif len(non_searchable_pages) == 0:
print(f"Document '{fname}' has {page_num} page(s). "
f"Complete document is searchable")
else:
print(f"searchable_pages : {searchable_pages}")
print(f"non_searchable_pages : {non_searchable_pages}")
else:
print(f"Not a valid document")
if __name__ == '__main__':
get_pdf_searchable_pages("1.pdf")
get_pdf_searchable_pages("1Scanned.pdf")
输出:
Document '1.pdf' has 1 page(s). Complete document is searchable
Document '1Scanned.pdf' has 1 page(s). Complete document is non-searchable
建立在 , along with some snippets I found at this link 之上,这里有一个可能的算法应该可以解决您的问题。
您需要安装 fitz
和 PyMuPDF
模块。您可以通过 pip
.
来完成
以下代码已通过 Python 3.7.9 和 PyMuPDF
1.16.14 测试。此外,在 PyMuPDF
之前安装 fitz
很重要,否则它会提供一些关于缺少前端模块的奇怪错误(不知道为什么)。所以这是我安装模块的方式:
pip3 install fitz
pip3 install PyMuPDF==1.16.14
这是 Python 3 的实现:
import fitz
def get_text_percentage(file_name: str) -> float:
"""
Calculate the percentage of document that is covered by (searchable) text.
If the returned percentage of text is very low, the document is
most likely a scanned PDF
"""
total_page_area = 0.0
total_text_area = 0.0
doc = fitz.open(file_name)
for page_num, page in enumerate(doc):
total_page_area = total_page_area + abs(page.rect)
text_area = 0.0
for b in page.getTextBlocks():
r = fitz.Rect(b[:4]) # rectangle where block text appears
text_area = text_area + abs(r)
total_text_area = total_text_area + text_area
doc.close()
return total_text_area / total_page_area
if __name__ == "__main__":
text_perc = get_text_percentage("my.pdf")
print(text_perc)
if text_perc < 0.01:
print("fully scanned PDF - no relevant text")
else:
print("not fully scanned PDF - text is present")
尽管这回答了您的问题(即区分完全扫描和 full/partial 文本 PDF),但此解决方案无法区分全文本 PDF 和其中也包含文本的扫描 PDF(例如此由 OCR 软件处理的扫描 PDF 就是这种情况 - 例如 pdfsandwich or Adobe Acrobat - 在图像顶部添加“不可见”文本块,以便您可以 select 文本)。
我创建了一个脚本来检测 PDF 是否为 OCRd。主要思想:在 OCRd PDF 中,文本是不可见的。
测试给定 PDF (f1
) 是否为 OCRd 的算法:
- 创建
f1
的副本,注释为 f2
- 删除
f2
上的所有文本
- 为
f1
和 f2
的所有(或少数)页面创建图像 (PNG)
如果 f1
和 f2
的所有图像都相同,f1
就是 OCRd。
https://github.com/jfilter/pdf-scripts/blob/master/is_ocrd_pdf.sh
#!/usr/bin/env bash
set -e
set -x
################################################################################
# Check if a PDF was scanned or created digitally, works on OCRd PDFs
#
# Usage:
# bash is_scanned_pdf.sh [-p] file
#
# Exit 0: Yes, file is a scanned PDF
# Exit 99: No, file was created digitally
#
# Arguments:
# -p or --pages: pos. integer, only consider first N pages
#
# Please report issues at https://github.com/jfilter/pdf-scripts/issues
#
# GPLv3, Copyright (c) 2020 Johannes Filter
################################################################################
# parse arguments
# h/t
max_pages=-1
# skip over positional argument of the file(s), thus -gt 1
while [[ "$#" -gt 1 ]]; do
case in
-p | --pages)
max_pages=""
shift
;;
*)
echo "Unknown parameter passed: "
exit 1
;;
esac
shift
done
# increment to make it easier with page numbering
max_pages=$((max_pages++))
command_exists() {
if ! [ -x $($(command -v &>/dev/null)) ]; then
echo $(error: is not installed.) >&2
exit 1
fi
}
command_exists mutool && command_exists gs && command_exists compare
command_exists pdfinfo
orig=$PWD
num_pages=$(pdfinfo | grep Pages | awk '{print }')
echo $num_pages
echo $max_pages
if ((($max_pages > 1) && ($max_pages < $num_pages))); then
num_pages=$max_pages
fi
cd $(mktemp -d)
for ((i = 1; i <= num_pages; i++)); do
mkdir -p output/$i && echo $i
done
# important to filter text on output of GS (tmp1), cuz GS alters input PDF...
gs -o tmp1.pdf -sDEVICE=pdfwrite -dLastPage=$num_pages &>/dev/null
gs -o tmp2.pdf -sDEVICE=pdfwrite -dFILTERTEXT tmp1.pdf &>/dev/null
mutool convert -o output/%d/1.png tmp1.pdf 2>/dev/null
mutool convert -o output/%d/2.png tmp2.pdf 2>/dev/null
for ((i = 1; i <= num_pages; i++)); do
echo $i
# difference in pixels, if 0 there are the same pictures
# discard diff image
if ! compare -metric AE output/$i/1.png output/$i/2.png null: 2>&1; then
echo " pixels difference, not a scanned PDF, mismatch on page $i"
exit 99
fi
done
您可以使用 pdfplumber。如果下面的代码 returns “None”,它是一个扫描的 pdf,否则它是可搜索的。
pip install pdfplumber
with pdfplumber.open(file_name) as pdf:
page = pdf.pages[0]
text = page.extract_text()
print(text)
要从扫描的 pdf 中提取文本,您可以使用 OCRmyPDF。非常简单的包装,一条线解决方案。您可以在包 here and a video explaining an example here 上找到更多信息。如果有帮助,请支持答案。祝你好运!
你可以使用ocrmypdf,它有一个参数可以跳过文本
更多信息在这里:https://ocrmypdf.readthedocs.io/en/latest/advanced.html
ocrmypdf.ocr(file_path, save_path, rotate_pages=True, remove_background=False, language=language, deskew=False, force_ocr=False, skip_text=True)
只是我重新修改了来自@Vikas Goel 的代码
但是在极少数情况下它没有给出像样的结果
def get_pdf_searchable_pages(fname):
""" intentifying a digitally created pdf or a scanned pdf"""
from pdfminer.pdfpage import PDFPage
searchable_pages = []
non_searchable_pages = []
page_num = 0
with open(fname, 'rb') as infile:
for page in PDFPage.get_pages(infile):
page_num += 1
if 'Font' in page.resources.keys():
searchable_pages.append(page_num)
else:
non_searchable_pages.append(page_num)
if page_num == len(searchable_pages):
return("searchable_pages")
elif page_num != len(searchable_pages):
return("non_searchable_pages")
else:
return("Not a valid document")
None 已发布的答案对我有用。不幸的是,这些解决方案通常会将扫描的 PDF 检测为文本 PDF,这通常是因为文档中存在媒体框。
虽然看起来很有趣,但事实证明以下代码对我的用例更准确:
extracted_text = ''.join([page.getText() for page in fitz.open(path)])
doc_type = "text" if extracted_text else "scan"
确保事先安装 fitz 和 PyMuPDF:
pip install fitz PyMuPDF
如果只有所有图像或其他,那么这里是使用 PyMuPDF 执行此操作的另一个版本:
import fitz
my_pdf = r"C:\Users\Test\FileName.pdf"
doc = fitz.open(my_pdf)
def pdftype(doc):
i=0
for page in doc:
if len(page.getText())>0: #for scanned page it will be 0
i+=1
if i>0:
print('full/partial text PDF file')
else:
print('only scanned images in PDF file')
pdftype(doc)
我有很多文件,有些是图片扫描成PDF,有些是full/partial文本PDF。
有没有办法检查这些文件以确保我们只处理扫描图像文件而不是 full/partial 文本 PDF 文件?
环境:PYTHON3.6
下面的代码可以工作,从可搜索和不可搜索的 PDF 中提取数据文本数据。
import fitz
text = ""
path = "Your_scanned_or_partial_scanned.pdf"
doc = fitz.open(path)
for page in doc:
text += page.getText()
如果您没有 fitz
模块,您需要这样做:
pip install --upgrade pymupdf
'/Resources'
上的 PDF 元数据检查怎么样?!
我相信对于PDF(电子文档)中的任何文本,都有更多的机会拥有字体,尤其是PDF,其objective是为了制作可移植文件,因此,它保持了字体定义.
如果您是 PyPDF2
用户,请尝试
pdf_reader = PyPDF2.PdfFileReader(input_file_location)
page_data = pdf_reader.getPage(page_num)
page_resources = page_data["/Resources"]
if "/Font" in page_resources:
print(
"[Info]: Looks like there is text in the PDF, contains:",
page_resources.keys(),
)
elif len(page_resources.get("/XObject", {})) != 1:
print("[Info]: PDF Contains:", page_resources.keys())
x_object = page_resources.get("/XObject", {})
for obj in x_object:
obj_ = x_object[obj]
if obj_["/Subtype"] == "/Image":
print("[Info]: PDF is image only")
尝试OCRmyPDF。 您可以使用此命令将扫描的 pdf 转换为数字 pdf。
ocrmypdf input_scanned.pdf output_digital.pdf
如果输入的 pdf 是数字的,命令将抛出错误 "PriorOcrFoundError: page already has text!"。
import subprocess as sp
import re
output = sp.getoutput("ocrmypdf input.pdf output.pdf")
if not re.search("PriorOcrFoundError: page already has text!",output):
print("Uploaded scanned pdf")
else:
print("Uploaded digital pdf")
def get_pdf_searchable_pages(fname):
# pip install pdfminer
from pdfminer.pdfpage import PDFPage
searchable_pages = []
non_searchable_pages = []
page_num = 0
with open(fname, 'rb') as infile:
for page in PDFPage.get_pages(infile):
page_num += 1
if 'Font' in page.resources.keys():
searchable_pages.append(page_num)
else:
non_searchable_pages.append(page_num)
if page_num > 0:
if len(searchable_pages) == 0:
print(f"Document '{fname}' has {page_num} page(s). "
f"Complete document is non-searchable")
elif len(non_searchable_pages) == 0:
print(f"Document '{fname}' has {page_num} page(s). "
f"Complete document is searchable")
else:
print(f"searchable_pages : {searchable_pages}")
print(f"non_searchable_pages : {non_searchable_pages}")
else:
print(f"Not a valid document")
if __name__ == '__main__':
get_pdf_searchable_pages("1.pdf")
get_pdf_searchable_pages("1Scanned.pdf")
输出:
Document '1.pdf' has 1 page(s). Complete document is searchable
Document '1Scanned.pdf' has 1 page(s). Complete document is non-searchable
建立在
您需要安装 fitz
和 PyMuPDF
模块。您可以通过 pip
.
以下代码已通过 Python 3.7.9 和 PyMuPDF
1.16.14 测试。此外,在 PyMuPDF
之前安装 fitz
很重要,否则它会提供一些关于缺少前端模块的奇怪错误(不知道为什么)。所以这是我安装模块的方式:
pip3 install fitz
pip3 install PyMuPDF==1.16.14
这是 Python 3 的实现:
import fitz
def get_text_percentage(file_name: str) -> float:
"""
Calculate the percentage of document that is covered by (searchable) text.
If the returned percentage of text is very low, the document is
most likely a scanned PDF
"""
total_page_area = 0.0
total_text_area = 0.0
doc = fitz.open(file_name)
for page_num, page in enumerate(doc):
total_page_area = total_page_area + abs(page.rect)
text_area = 0.0
for b in page.getTextBlocks():
r = fitz.Rect(b[:4]) # rectangle where block text appears
text_area = text_area + abs(r)
total_text_area = total_text_area + text_area
doc.close()
return total_text_area / total_page_area
if __name__ == "__main__":
text_perc = get_text_percentage("my.pdf")
print(text_perc)
if text_perc < 0.01:
print("fully scanned PDF - no relevant text")
else:
print("not fully scanned PDF - text is present")
尽管这回答了您的问题(即区分完全扫描和 full/partial 文本 PDF),但此解决方案无法区分全文本 PDF 和其中也包含文本的扫描 PDF(例如此由 OCR 软件处理的扫描 PDF 就是这种情况 - 例如 pdfsandwich or Adobe Acrobat - 在图像顶部添加“不可见”文本块,以便您可以 select 文本)。
我创建了一个脚本来检测 PDF 是否为 OCRd。主要思想:在 OCRd PDF 中,文本是不可见的。
测试给定 PDF (f1
) 是否为 OCRd 的算法:
- 创建
f1
的副本,注释为f2
- 删除
f2
上的所有文本
- 为
f1
和f2
的所有(或少数)页面创建图像 (PNG)
如果 f1
就是 OCRd。
f1
和 f2
的所有图像都相同,https://github.com/jfilter/pdf-scripts/blob/master/is_ocrd_pdf.sh
#!/usr/bin/env bash
set -e
set -x
################################################################################
# Check if a PDF was scanned or created digitally, works on OCRd PDFs
#
# Usage:
# bash is_scanned_pdf.sh [-p] file
#
# Exit 0: Yes, file is a scanned PDF
# Exit 99: No, file was created digitally
#
# Arguments:
# -p or --pages: pos. integer, only consider first N pages
#
# Please report issues at https://github.com/jfilter/pdf-scripts/issues
#
# GPLv3, Copyright (c) 2020 Johannes Filter
################################################################################
# parse arguments
# h/t
max_pages=-1
# skip over positional argument of the file(s), thus -gt 1
while [[ "$#" -gt 1 ]]; do
case in
-p | --pages)
max_pages=""
shift
;;
*)
echo "Unknown parameter passed: "
exit 1
;;
esac
shift
done
# increment to make it easier with page numbering
max_pages=$((max_pages++))
command_exists() {
if ! [ -x $($(command -v &>/dev/null)) ]; then
echo $(error: is not installed.) >&2
exit 1
fi
}
command_exists mutool && command_exists gs && command_exists compare
command_exists pdfinfo
orig=$PWD
num_pages=$(pdfinfo | grep Pages | awk '{print }')
echo $num_pages
echo $max_pages
if ((($max_pages > 1) && ($max_pages < $num_pages))); then
num_pages=$max_pages
fi
cd $(mktemp -d)
for ((i = 1; i <= num_pages; i++)); do
mkdir -p output/$i && echo $i
done
# important to filter text on output of GS (tmp1), cuz GS alters input PDF...
gs -o tmp1.pdf -sDEVICE=pdfwrite -dLastPage=$num_pages &>/dev/null
gs -o tmp2.pdf -sDEVICE=pdfwrite -dFILTERTEXT tmp1.pdf &>/dev/null
mutool convert -o output/%d/1.png tmp1.pdf 2>/dev/null
mutool convert -o output/%d/2.png tmp2.pdf 2>/dev/null
for ((i = 1; i <= num_pages; i++)); do
echo $i
# difference in pixels, if 0 there are the same pictures
# discard diff image
if ! compare -metric AE output/$i/1.png output/$i/2.png null: 2>&1; then
echo " pixels difference, not a scanned PDF, mismatch on page $i"
exit 99
fi
done
您可以使用 pdfplumber。如果下面的代码 returns “None”,它是一个扫描的 pdf,否则它是可搜索的。
pip install pdfplumber
with pdfplumber.open(file_name) as pdf:
page = pdf.pages[0]
text = page.extract_text()
print(text)
要从扫描的 pdf 中提取文本,您可以使用 OCRmyPDF。非常简单的包装,一条线解决方案。您可以在包 here and a video explaining an example here 上找到更多信息。如果有帮助,请支持答案。祝你好运!
你可以使用ocrmypdf,它有一个参数可以跳过文本
更多信息在这里:https://ocrmypdf.readthedocs.io/en/latest/advanced.html
ocrmypdf.ocr(file_path, save_path, rotate_pages=True, remove_background=False, language=language, deskew=False, force_ocr=False, skip_text=True)
只是我重新修改了来自@Vikas Goel 的代码 但是在极少数情况下它没有给出像样的结果
def get_pdf_searchable_pages(fname):
""" intentifying a digitally created pdf or a scanned pdf"""
from pdfminer.pdfpage import PDFPage
searchable_pages = []
non_searchable_pages = []
page_num = 0
with open(fname, 'rb') as infile:
for page in PDFPage.get_pages(infile):
page_num += 1
if 'Font' in page.resources.keys():
searchable_pages.append(page_num)
else:
non_searchable_pages.append(page_num)
if page_num == len(searchable_pages):
return("searchable_pages")
elif page_num != len(searchable_pages):
return("non_searchable_pages")
else:
return("Not a valid document")
None 已发布的答案对我有用。不幸的是,这些解决方案通常会将扫描的 PDF 检测为文本 PDF,这通常是因为文档中存在媒体框。
虽然看起来很有趣,但事实证明以下代码对我的用例更准确:
extracted_text = ''.join([page.getText() for page in fitz.open(path)])
doc_type = "text" if extracted_text else "scan"
确保事先安装 fitz 和 PyMuPDF:
pip install fitz PyMuPDF
如果只有所有图像或其他,那么这里是使用 PyMuPDF 执行此操作的另一个版本:
import fitz
my_pdf = r"C:\Users\Test\FileName.pdf"
doc = fitz.open(my_pdf)
def pdftype(doc):
i=0
for page in doc:
if len(page.getText())>0: #for scanned page it will be 0
i+=1
if i>0:
print('full/partial text PDF file')
else:
print('only scanned images in PDF file')
pdftype(doc)